我想在 ARM NEON 和 SSE 上对以下循环进行矢量化:
for (int i = 0; i < n; ++i) {
b[i][0] = 0.0;
for (int j = 1; j < n; ++j) {
b[i][j] = b[i][j - 1] + a[i][j];
}
}
该循环具有循环携带依赖性,因此不能简单地向量化。相反,可以使用外循环矢量化进行矢量化。
假设
a
和 b
是 float32,通过并行运行内循环的 4 个实例来实现外循环矢量化函数。以下是矢量化循环的迭代:
First iteration: { { i = 0, j = 1 }, { i = 1, j = 1 }, { i = 2, j = 1 }, { i = 3 , j = 1 } }
Second iteration: { { i = 0, j = 2 }, { i = 1, j = 2 }, { i = 2, j = 2 }, { i = 3 , j = 2 } }
Etc.
我想使用 NEON 和 SSE 向量化这个循环。它们都不支持聚集和分散,而这是简单矢量化此循环所必需的。您知道如何有效地矢量化这个循环吗?
代码中的算法称为“独占前缀和”。
我不确定SIMD会带来多少利润,但是,关于SSE,请尝试以下版本。微内核:
// Given a vector of [ a, b, c, d ], compute a vector of [ a, a+b, a+b+c, a+b+c+d ]
// The function requires SSE 4.1 support
__m128 inclusivePrefixSum( __m128 v )
{
// a*2, a+b, c*2, c+d
__m128 tmp = _mm_add_ps( v, _mm_moveldup_ps( v ) );
// a, a+b, c, c+d
tmp = _mm_blend_ps( tmp, v, 0b0101 );
// Broadcasted a+b
__m128 res = _mm_shuffle_ps( tmp, tmp, _MM_SHUFFLE( 1, 1, 1, 1 ) );
// a*2+b, (a+b)*2, a+b+c, a+b+c+d
res = _mm_add_ps( res, tmp );
// a, a+b, a+b+c, a+b+c+d
res = _mm_blend_ps( res, tmp, 0b0011 );
return res;
}
用途:
const size_t innerEndAligned = 1 + ( ( n - 1 ) / 4 ) * 4;
for( size_t i = 0; i < n; i++ )
{
float* const pa = a[ i ];
float* const pb = b[ i ];
__m128 acc = _mm_setzero_ps();
_mm_store_ss( pb, acc );
size_t j;
for( j = 1; j < innerEndAligned; j += 4 )
{
// Load 4 elements
__m128 vec = _mm_loadu_ps( pa + j );
// Compute prefix sum
vec = inclusivePrefixSum( vec );
// Increment by the accumulator
vec = _mm_add_ps( vec, acc );
// Store 4 output elements
_mm_storeu_ps( pb + j, vec );
// Broadcast the last lane from the vector,
// which contains cumulative sum of the row so far, into the accumulator
acc = _mm_shuffle_ps( vec, vec, _MM_SHUFFLE( 3, 3, 3, 3 ) );
}
// Handle the last 0-3 elements of the row with scalar code
for( ; j < n; j++ )
{
__m128 v = _mm_load_ss( pa + j );
acc = _mm_add_ss( acc, v );
_mm_store_ss( pb + j, acc );
}
}
我目前对这个问题很感兴趣,想问一下这个最终实现了吗?