这个矩阵乘法码的ARMv7到ARMv8 NEON端口是什么？

Question

    // http://infocenter.arm.com/help/topic/com.arm.doc.dai0425/DAI0425_migrating_an_application_from_ARMv5_to_ARMv7_AR.pdf
// p. 4-21

.macro mul_col_f32 res_q, col0_d, col1_d
vmul.f32 \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0
vmla.f32 \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32 \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32 \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
.endm

// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100748_0606_00_en/lmi1470147220260.html
// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0203j/Cacjfjei.html

.globl  mat44mulneon
.p2align 2 // what's this ?
.type mat44mulneon,%function
mat44mulneon:
.fnstart // not recognized by eclipse syntax coloring?
// ---------
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0
vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0
vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1.
vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1.
mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0
mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1
mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2
mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3
vst1.32 {d24-d27}, [r0]! @ store first eight elements of result.
vst1.32 {d28-d31}, [r0]! @ store second eight elements of result.
// ---------
bx lr // Return by branching to the address in the link register.
.fnend

我在ARM站点上找到的代码（参见注释中的链接）可以在我的ARM Cortex A9机器上运行，即ARMv7机器。

我现在试图让它在ARMv8 / aarch64 CPU上运行。我找到了这张幻灯片：porting to ARM64

最后，它显示了矩阵乘法代码。但它使用循环，我想（如果我没有看到这个，请纠正我）如果移植到新的ARMv8助记符，我发布的代码会更快。链接文档还显示了一些v7 - > v8更改，例如我把像vmul.32这样的东西换成了fmul等等。示例中给出的寄存器名称与上面发布的代码中的寄存器名称不匹配。由于我对任何ARM asm都不熟悉，我不知道这里的等价物是什么。例如。当我构建我的项目时，我得到一个错误，如：

operand 1 must be a SIMD vector register list -- `st1 {d24-d27},[r0]

我不确定这是唯一的问题，所以我宁愿问：在aarch64机器上运行的代码需要做哪些更改？

Answer 1

这是一个粗略的AArch64版本的例程：

.macro mul_col_f32 res, col
    fmul \res, v16.4s, \col[0] // multiply col element 0 by matrix col 0
    fmla \res, v17.4s, \col[1] // multiply-acc col element 1 by matrix col 1
    fmla \res, v18.4s, \col[2] // multiply-acc col element 2 by matrix col 2
    fmla \res, v19.4s, \col[3] // multiply-acc col element 3 by matrix col 3
.endm

.globl  mat44mulneon
mat44mulneon:
    ld1 {v16.4s, v17.4s, v18.4s, v19.4s}, [x1] 
    ld1 {v0.4s,  v1.4s,  v2.4s,  v3.4s},  [x2] 
    mul_col_f32 v24.4s, v0.s // matrix 0 * matrix 1 col 0
    mul_col_f32 v25.4s, v1.s // matrix 0 * matrix 1 col 1
    mul_col_f32 v26.4s, v2.s // matrix 0 * matrix 1 col 2
    mul_col_f32 v27.4s, v3.s // matrix 0 * matrix 1 col 3
    st1 {v24.4s, v25.4s, v26.4s, v27.4s}, [x0] 
ret

除了链接演示文稿中提到的一般内容之外，还有一些关于转换的非全面说明：

您可以使用一条ld1指令加载最多64个字节，而使用AArch32中的vld1则加载32个字节。这避免了递增r0 / r1 / r2或x0 / x1 / x2指针的需要
我省略了特定的操作系统/二进制格式.fnstart，.fnend和.type，如果需要，它们可以在原始版本的相同位置读取
对于AArch64程序集，@不再是注释字符
col的mul_col_f32参数的形式为v0.s，与v0.4s相反。当选择特定的通道时，在与宏中的[0]后缀连接后，应该省略通道的数量，例如：要选择v0.4s登记册的第一条车道，它应该写成v0.s[0]。 GNU汇编程序允许使用v0.4s[0]，但其他汇编程序（Clang / LLVM内置汇编程序和Microsoft的armasm64）只允许使用前一种语法。

这个矩阵乘法码的ARMv7到ARMv8 NEON端口是什么？

问题描述投票：0回答：1

1个回答

最新问题

这个矩阵乘法码的ARMv7到ARMv8 NEON端口是什么？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1