使用内联汇编在数组上循环

Question

当使用内联汇编循环数组时，我应该使用寄存器修饰符“r”还是内存修饰符“m”？

让我们考虑一个例子，它添加两个浮点数组x和y并将结果写入z。通常我会使用内在函数这样做

for(int i=0; i<n/4; i++) {
    __m128 x4 = _mm_load_ps(&x[4*i]);
    __m128 y4 = _mm_load_ps(&y[4*i]);
    __m128 s = _mm_add_ps(x4,y4);
    _mm_store_ps(&z[4*i], s);
}

这是我使用寄存器修饰符“r”提出的内联汇编解决方案

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

这会产生与GCC类似的组装。主要区别在于GCC将16添加到索引寄存器并使用1的标度，而内联汇编解决方案将4添加到索引寄存器并使用4的标度。

我无法使用通用寄存器作为迭代器。我必须指定一个在这种情况下是rax。是否有一个原因？

这是我想出的使用内存修饰符“m”的解决方案

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

这样效率较低，因为它不使用索引寄存器，而是必须将16添加到每个数组的基址寄存器中。生成的程序集是（gcc（Ubuntu 5.2.1-22ubuntu2）和gcc -O3 -S asmtest.c）：

.L22
    movaps   (%rsi), %xmm0
    addps    (%rdi), %xmm0
    movaps   %xmm0, (%rdx)
    addl    $4, %eax
    addq    $16, %rdx
    addq    $16, %rsi
    addq    $16, %rdi
    cmpl    %eax, %ecx
    ja      .L22

使用内存修饰符“m”有更好的解决方案吗？有没有办法让它使用索引寄存器？我问的原因是，因为我正在阅读和编写内存，所以使用内存修饰符“m”对我来说似乎更合乎逻辑。另外，使用寄存器修饰符“r”我从不使用输出操作数列表，这对我来说似乎很奇怪。

也许有比使用“r”或“m”更好的解决方案？

这是我用来测试它的完整代码

#include <stdio.h>
#include <x86intrin.h>

#define N 64

void add_intrin(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __m128 x4 = _mm_load_ps(&x[i]);
        __m128 y4 = _mm_load_ps(&y[i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[i], s);
    }
}

void add_intrin2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i++) {
        __m128 x4 = _mm_load_ps(&x[4*i]);
        __m128 y4 = _mm_load_ps(&y[4*i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[4*i], s);
    }
}

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

int main(void) {
    float x[N], y[N], z1[N], z2[N], z3[N];
    for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
    add_intrin2(x,y,z1,N);
    add_asm1(x,y,z2,N);
    add_asm2(x,y,z3,N);
    for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}

Answer 1

尽可能避免使用内联asm：https://gcc.gnu.org/wiki/DontUseInlineAsm。它阻止了许多优化。但是如果你真的不能手持编译器来制作你想要的asm，你应该在asm中编写你的整个循环，这样你就可以手动展开和调整它，而不是像这样做。

您可以对索引使用r约束。使用q修饰符获取64位寄存器的名称，以便在寻址模式下使用它。当编译为32位目标时，q修饰符选择32位寄存器的名称，因此相同的代码仍然有效。

如果要选择使用何种寻址模式，则需要使用带有r约束的指针操作数自行完成。

GNU C inline asm语法不假定您读取或写入指针操作数指向的内存。（例如，你可能在指针值上使用了inline-asm and）。因此，您需要使用"memory" clobber或内存输入/输出操作数来执行操作，以便让它知道您修改的内存。一个"memory" clobber是容易的，但强制除了当地人以外的一切溢出/重新加载。有关使用虚拟输入操作数的示例，请参阅Clobbers section in the docs。

具体来说，"m" (*(const float (*)[]) fptr)将告诉编译器整个数组对象是一个输入，任意长度。即，asm不能与使用fptr作为地址一部分的任何商店重新排序（或者使用已知指向的数组）。也适用于"=m"或"+m"约束（显然没有const）。

使用像"m" (*(const float (*)[4]) fptr)这样的特定大小可以告诉编译器你做什么/不读什么。（或写）。然后它可以（如果允许的话）将商店下沉到asm语句之后的后一个元素，并将它与您的内联asm不读取的任何商店的另一个商店（或做死库消除）相结合。

m约束的另一个巨大好处是-funroll-loops可以通过生成具有恒定偏移的地址来工作。自己进行寻址可以防止编译器每4次迭代或其他事情进行一次增量，因为i的每个源级值都需要出现在寄存器中。

这是我的版本，评论中提到了一些调整。

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // you can avoid a "memory" clobber with dummy input/output operands
        );
    }
}

Godbolt compiler explorer asm输出为此以及下面几个版本。

你的版本需要声明%xmm0被破坏，否则你会在内联时遇到糟糕的时间。我的版本使用临时变量作为从不使用的仅输出操作数。这为编译器提供了完全自由的寄存器分配。

如果你想避免“内存”破坏，你可以使用虚拟内存输入/输出操作数，如"m" (*(const __m128*)&x[i])告诉编译器你的函数读取和写入哪些内存。如果你在运行该循环之前做了像x[4] = 1.0;这样的事情，这对于确保正确的代码生成是必要的。（即使你没有写一些简单的东西，内联和常量传播也可以归结为它。）并且还要确保编译器在循环运行之前不从z[]读取。

在这种情况下，我们得到了可怕的结果：gcc5.x实际上增加了3个额外的指针，因为它决定使用[reg]寻址模式而不是索引。它不知道内联asm从未使用约束创建的寻址模式实际引用那些内存操作数！

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    addq    $16, %r10       #, ivtmp.19
    addq    $16, %r9        #, ivtmp.21
    addq    $16, %r8        #, ivtmp.22
    cmpl    %eax, %ecx      # i, n
    ja      .L11        #,

r8，r9和r10是内联asm块不使用的额外指针。

您可以使用约束来告诉gcc任意长度的整个数组是输入或输出：来自"m" (*(const struct {char a; char x[];} *) pStr)的@David Wohlferd's answer on an asm strlen。由于我们想要使用索引寻址模式，我们将在寄存器中具有所有三个数组的基地址，并且这种约束形式要求将基地址作为操作数，而不是指向当前正在操作的存储器的指针。

这实际上在循环内没有任何额外的计数器增量：

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
                             float *restrict z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
             , "=m" (*(struct {float a; float x[];} *) z)
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
              , "m" (*(const struct {float a; float x[];} *) x),
                "m" (*(const struct {float a; float x[];} *) y)
        );
    }
}

这给了我们与"memory" clobber相同的内循环：

.L19:   # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    cmpl    %eax, %ecx      # i, n
    ja      .L19        #,

它告诉编译器每个asm块读取或写入整个数组，因此它可能会不必要地阻止它与其他代码交错（例如，在完全展开后以低迭代计数）。它不会停止展开，但要求在寄存器中包含每个索引值确实会降低其效率。

m约束的版本，that gcc can unroll：

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"
            "addps    %[xi], %[vectmp]\n\t"  // We requested that the %[yi] input be in the same register as the [vectmp] dummy output
            "movaps   %[vectmp], %[zi]\n\t"
          // ugly ugly type-punning casts; __m128 is a may_alias type so it's safe.
            : [vectmp] "=x" (vectmp), [zi] "=m" (*(__m128*)&z[i])
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
            , [xi] "xm" (*(__m128*)&x[i])
            :  // memory clobber not needed
        );
    }
}

使用[yi]作为+x输入/输出操作数会更简单，但是以这种方式编写它会在内联asm中取消注释负载时进行较小的更改，而不是让编译器为我们获取一个值到寄存器中。

Answer 2

当我用gcc（4.9.2）编译你的add_asm2代码时，我得到：

add_asm2:
.LFB0:
        .cfi_startproc
        xorl        %eax, %eax
        xorl        %r8d, %r8d
        testl       %ecx, %ecx
        je  .L1
        .p2align 4,,10
        .p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
        movaps   (%rsi,%rax), %xmm0
addps    (%rdi,%rax), %xmm0
movaps   %xmm0, (%rdx,%rax)

# 0 "" 2
#NO_APP
        addl        $4, %r8d
        addq        $16, %rax
        cmpl        %r8d, %ecx
        ja  .L5
.L1:
        rep; ret
        .cfi_endproc

所以它不完美（它使用冗余寄存器），但确实使用索引加载......

Answer 3

gcc也有builtin vector extensions甚至跨平台：

typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i+=1) {
        *(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
    }
}

在我的gcc版本4.7.2上，生成的程序集是：

.L28:
        movaps  (%rdi,%rax), %xmm0
        addps   (%rsi,%rax), %xmm0
        movaps  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rcx, %rax
        jne     .L28

使用内联汇编在数组上循环

问题描述投票：5回答：3

3个回答

最新问题

使用内联汇编在数组上循环

问题描述 投票：5回答：3

3个回答

最新问题

问题描述投票：5回答：3