为什么对于 RGB 到灰度,SIMD 只提高一点点性能,SIMD 乘法但向量元素的标量加法?

问题描述 投票:0回答:1

我正在学习如何使用SIMD进行图像处理。但是,我想知道为什么使用 SIMD 后性能没有太大改善。

  • 图片尺寸:3840*2160
  • 图像格式:PixelFormat.Format24bppRgb
  • 功能:24位到8位
  1. 正常方式与不安全和并行。平均:48ms
static void GrayViaParallel(BitmapData org, BitmapData des)
{
    int width = org.Width;
    int height = org.Height;

    var orgp = (byte*)org.Scan0.ToPointer();
    var desp = (byte*)des.Scan0.ToPointer();

    Parallel.For(0, height, i =>
    {
        int orgSd = i * org.Stride;
        int desSd = i * des.Stride;
        for (int j = 0; j < width; j++)
        {
            //                              Red                     Green                  Blue
            desp[desSd] = (byte)((orgp[orgSd + 2] * 19595 + orgp[orgSd + 1] * 38469 + orgp[orgSd] * 7472) >> 16);
            desSd++;
            orgSd += 3;
        }
    });
}
  1. 实现的SIMD:平均32ms
static void GrayViaParallelAndSIMD(byte* src, byte* dst, int count)
{
    var Coeleft = Vector128.Create(mulBlue, mulGreen, mulRed, mulBlue, mulGreen, mulRed, mulBlue, mulGreen);
    var CoeRight = Vector128.Create(mulRed, mulBlue, mulGreen, mulRed, mulBlue, mulGreen, mulRed, 0);

    int allPixels = count * 3;
    byte* srcEnd = src + allPixels; //Is it wrong?
    int stride = 15; //Proceed 15 bytes per step
    int loopCount = (int)((srcEnd - src) / stride);

    Parallel.For(0, loopCount, i =>
    {
        int curPos = (i + 1) * stride;
        if (curPos < allPixels) //If not added,  it will exceed the image data
        {
            // Load the first 16 bytes of the pixels
            var _1st16bytes = Sse2.LoadVector128(src + i * stride);

            // Get the first 8 bytes
            var low = Sse2.UnpackLow(_1st16bytes, Vector128<byte>.Zero).AsUInt16();
            //Get the next 8 bytes
            var high = Sse2.UnpackHigh(_1st16bytes, Vector128<byte>.Zero).AsUInt16();

            // Calculate the first 8 bytes
            var lowMul = Sse2.MultiplyHigh(Coeleft, low);
            // Calculate the next 8 bytes
            var highMul = Sse2.MultiplyHigh(CoeRight, high);

            //               Blue                     Green                   Red
            var px1 = lowMul.GetElement(0)  + lowMul.GetElement(1)  + lowMul.GetElement(2);
            var px2 = lowMul.GetElement(3)  + lowMul.GetElement(4)  + lowMul.GetElement(5);
            var px3 = lowMul.GetElement(6)  + lowMul.GetElement(7)  + highMul.GetElement(0);
            var px4 = highMul.GetElement(1) + highMul.GetElement(2) + highMul.GetElement(3);
            var px5 = highMul.GetElement(4) + highMul.GetElement(5) + highMul.GetElement(6);

            //15 bytes for 5 pixels 
            var i5 = i * 5;

            dst[i5    ] = (byte)px1;
            dst[i5 + 1] = (byte)px2;
            dst[i5 + 2] = (byte)px3;
            dst[i5 + 3] = (byte)px4;
            dst[i5 + 4] = (byte)px5;
        }
    });
}

请问有更好的方法吗?或者我该如何增强它? 如果有任何建议,我将非常感激。

我尝试过不同的 SIMD 方法,但似乎都效果不佳。我正在寻求效率的大幅提升。

c# simd sse unsafe parallel.for
1个回答
0
投票

GetElement 方法是标量操作,不具备 SIMD 硬件加速功能。为了获得更好的 SIMD 硬件加速,您需要尽可能避免标量运算。

对于24位到8位灰度转换,可以采用这种方式:从源位图中一次读取3个向量,进行3元组解交织操作,得到R、G、B平面数据。随后,使用矢量化乘法和加法来计算灰度值。结果是一个存储灰度值的 1-向量,可以将其存储到目标位图。

例如,SSE指令集使用128位向量,其中1个向量为16字节。从源位图中一次读取 3 个向量相当于读取 48 个字节,即 16 个 RGB 像素。最后,将一个向量存储到目标位图中意味着写入 16 个字节,即 16 个灰度像素。

对于去交错 3 元素组,可以使用 shuffle 类别的指令来完成。例如,对于X86架构中的128位向量,可以使用SSSE3中的_mm_shuffle_epi8指令,该指令对应于.NET中的

Ssse3.Shuffle
方法。源码如下

static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part0 = Vector128.Create((sbyte)1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part0 = Vector128.Create((sbyte)2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15).AsByte();

public static Vector128<byte> YGroup3Unzip(Vector128<byte> data0, Vector128<byte> data1, Vector128<byte> data2, out Vector128<byte> y, out Vector128<byte> z) {
    var f0A = YGroup3Unzip_Shuffle_Byte_X_Part0;
    var f0B = YGroup3Unzip_Shuffle_Byte_X_Part1;
    var f0C = YGroup3Unzip_Shuffle_Byte_X_Part2;
    var f1A = YGroup3Unzip_Shuffle_Byte_Y_Part0;
    var f1B = YGroup3Unzip_Shuffle_Byte_Y_Part1;
    var f1C = YGroup3Unzip_Shuffle_Byte_Y_Part2;
    var f2A = YGroup3Unzip_Shuffle_Byte_Z_Part0;
    var f2B = YGroup3Unzip_Shuffle_Byte_Z_Part1;
    var f2C = YGroup3Unzip_Shuffle_Byte_Z_Part2;
    var rt0 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f0A), Ssse3.Shuffle(data1, f0B)), Ssse3.Shuffle(data2, f0C));
    var rt1 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f1A), Ssse3.Shuffle(data1, f1B)), Ssse3.Shuffle(data2, f1C));
    var rt2 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f2A), Ssse3.Shuffle(data1, f2B)), Ssse3.Shuffle(data2, f2C));
    y = rt1;
    z = rt2;
    return rt0;
}

为了更轻松地跨平台编写向量算法,我开发了VectorTraits库,它已经集成了上述算法。该库提供了方法 Vectors.YGroup3Unzip 方法。该方法是跨平台的,使用各个平台的shuffle指令。

  • X86 256 位:使用
    _mm256_shuffle_epi8
    和其他指令。
  • 手臂:使用
    vqvtbl1q_u8
    说明。
  • Wasm:使用
    i8x16.swizzle
    说明。

通过

Vectors.YGroup3Unzip
的方法,很容易写出24位转8位的灰度算法。灰度系数的精度为8位,因此需要将8位数据扩展为16位,然后进行乘法和加法计算。最后,16位数据被缩小为8位。源码如下

public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 3; // Bgr24
    const int shiftPoint = 8;
    const int mulPoint = 1 << shiftPoint; // 0x100
    const ushort mulRed = (ushort)(0.299 * mulPoint + 0.5); // 77
    const ushort mulGreen = (ushort)(0.587 * mulPoint + 0.5); // 150
    const ushort mulBlue = mulPoint - mulRed - mulGreen; // 29
    Vector<ushort> vmulRed = new Vector<ushort>(mulRed);
    Vector<ushort> vmulGreen = new Vector<ushort>(mulGreen);
    Vector<ushort> vmulBlue = new Vector<ushort>(mulBlue);
    int vectorWidth = Vector<byte>.Count;
    int maxX = width - vectorWidth;
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        Vector<byte>* pLast = (Vector<byte>*)(pRow + maxX * cbPixel);
        Vector<byte>* qLast = (Vector<byte>*)(qRow + maxX * 1);
        Vector<byte>* p = (Vector<byte>*)pRow;
        Vector<byte>* q = (Vector<byte>*)qRow;
        for (; ; ) {
            Vector<byte> r, g, b, gray;
            Vector<ushort> wr0, wr1, wg0, wg1, wb0, wb1;
            // Load.
            b = Vectors.YGroup3Unzip(p[0], p[1], p[2], out g, out r);
            // widen(r) * mulRed + widen(g) * mulGreen + widen(b) * mulBlue
            Vector.Widen(r, out wr0, out wr1);
            Vector.Widen(g, out wg0, out wg1);
            Vector.Widen(b, out wb0, out wb1);
            wr0 = Vectors.Multiply(wr0, vmulRed);
            wr1 = Vectors.Multiply(wr1, vmulRed);
            wg0 = Vectors.Multiply(wg0, vmulGreen);
            wg1 = Vectors.Multiply(wg1, vmulGreen);
            wb0 = Vectors.Multiply(wb0, vmulBlue);
            wb1 = Vectors.Multiply(wb1, vmulBlue);
            wr0 = Vector.Add(wr0, wg0);
            wr1 = Vector.Add(wr1, wg1);
            wr0 = Vector.Add(wr0, wb0);
            wr1 = Vector.Add(wr1, wb1);
            // Shift right and narrow.
            wr0 = Vectors.ShiftRightLogical_Const(wr0, shiftPoint);
            wr1 = Vectors.ShiftRightLogical_Const(wr1, shiftPoint);
            gray = Vector.Narrow(wr0, wr1);
            // Store.
            *q = gray;
            // Next.
            if (p >= pLast) break;
            p += cbPixel;
            ++q;
            if (p > pLast) p = pLast; // The last block is also use vector.
            if (q > qLast) q = qLast;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

上面源码中的

Vectors.ShiftRightLogical_Const
是VectorTraits库提供的方法。它取代了
Vector.ShiftRightLogical
中新增的
.NET 7.0
方法,并允许早期版本的 .NET 使用逻辑右移。

Vectors.Multiply
也是VectorTraits库提供的方法。它避免了无符号类型有时无法硬件加速的问题。

然后为算法编写基准代码。

[Benchmark]
public void UseVectors() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, false);
}

[Benchmark]
public void UseVectorsParallel() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, true);
}

public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, bool useParallel = false) {
    int vectorWidth = Vector<byte>.Count;
    int width = src.Width;
    int height = src.Height;
    if (width <= vectorWidth) {
        ScalarDo(src, dst);
        return;
    }
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    int processorCount = Environment.ProcessorCount;
    int batchSize = height / (processorCount * 2);
    bool allowParallel = useParallel && (batchSize > 0) && (processorCount > 1);
    if (allowParallel) {
        int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize)
        Parallel.For(0, batchCount, i => {
            int start = batchSize * i;
            int len = batchSize;
            if (start + len > height) len = height - start;
            byte* pSrc2 = pSrc + start * strideSrc;
            byte* pDst2 = pDst + start * strideDst;
            UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
        UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

完整源代码位于 Bgr24ToGray8Benchmark.cs

X86架构上的基准测试结果如下。

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method               | Width | Mean         | Error      | StdDev     | Ratio | RatioSD | Code Size |
|--------------------- |------ |-------------:|-----------:|-----------:|------:|--------:|----------:|
| Scalar               | 1024  |  1,028.55 us |  12.545 us |  11.735 us |  1.00 |    0.02 |     152 B |
| UseVectors           | 1024  |     94.06 us |   0.606 us |   0.537 us |  0.09 |    0.00 |        NA |
| UseVectorsParallel   | 1024  |     24.98 us |   0.390 us |   0.365 us |  0.02 |    0.00 |        NA |
| PeterParallelScalar  | 1024  |    216.47 us |   1.719 us |   1.524 us |  0.21 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 2048  |  4,092.26 us |  21.098 us |  18.703 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 2048  |    507.70 us |   9.626 us |  11.459 us |  0.12 |    0.00 |        NA |
| UseVectorsParallel   | 2048  |    118.98 us |   1.025 us |   0.959 us |  0.03 |    0.00 |        NA |
| PeterParallelScalar  | 2048  |    803.30 us |   9.226 us |   8.630 us |  0.20 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 4096  | 16,391.12 us | 121.643 us | 113.785 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 4096  |  2,472.16 us |  32.452 us |  30.356 us |  0.15 |    0.00 |        NA |
| UseVectorsParallel   | 4096  |  2,034.85 us |  33.074 us |  30.937 us |  0.12 |    0.00 |        NA |
| PeterParallelScalar  | 4096  |  3,139.85 us |  32.657 us |  27.270 us |  0.19 |    0.00 |        NA |
  • Scalar:标量算法。
  • UseVectors:向量算法。
  • UseVectorsParallel:并行向量算法。
  • PeterParallelScalar:Peter 编写的并行标量算法(GrayViaParallel)。

相同的源代码可以在Arm架构上运行。基准测试结果如下。

BenchmarkDotNet v0.14.0, macOS Sequoia 15.0.1 (24A348) [Darwin 24.0.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD [AttachedDebugger]
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD


| Method               | Width | Mean         | Error     | StdDev    | Ratio | RatioSD |
|--------------------- |------ |-------------:|----------:|----------:|------:|--------:|
| Scalar               | 1024  |    635.31 us |  0.537 us |  0.448 us |  1.00 |    0.00 |
| UseVectors           | 1024  |    127.04 us |  0.567 us |  0.474 us |  0.20 |    0.00 |
| UseVectorsParallel   | 1024  |     46.37 us |  0.336 us |  0.314 us |  0.07 |    0.00 |
| PeterParallelScalar  | 1024  |    202.19 us |  1.025 us |  0.959 us |  0.32 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 2048  |  2,625.64 us |  1.795 us |  1.402 us |  1.00 |    0.00 |
| UseVectors           | 2048  |    521.40 us |  0.301 us |  0.282 us |  0.20 |    0.00 |
| UseVectorsParallel   | 2048  |    152.11 us |  3.548 us | 10.064 us |  0.06 |    0.00 |
| PeterParallelScalar  | 2048  |    711.00 us |  1.806 us |  1.601 us |  0.27 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 4096  | 10,457.09 us |  5.697 us |  5.051 us |  1.00 |    0.00 |
| UseVectors           | 4096  |  2,058.16 us |  4.110 us |  3.643 us |  0.20 |    0.00 |
| UseVectorsParallel   | 4096  |  1,152.15 us | 21.134 us | 21.703 us |  0.11 |    0.00 |
| PeterParallelScalar  | 4096  |  2,897.94 us | 56.893 us | 91.871 us |  0.28 |    0.01 |
© www.soinside.com 2019 - 2024. All rights reserved.