我正在尝试在GPU上并行化模拟的简单更新循环。基本上有一堆由圆圈表示的“生物”,每个更新循环中都会移动,然后会检查它们是否相交。半径是不同类型生物的半径。
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32, uint32)')
def update(p_x, p_y, radii, types, velocities, max_velocities, acceleration, num_creatures, cycles):
for c in range(cycles):
for i in range(num_creatures):
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
for i in range(num_creatures):
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
for y in range(1, 600 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = 16
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities,
acceleration, num_creatures, cycles)
print(device_p_x.copy_to_host())
math.cos和math.sin中的1.0只是个别生物方向的占位符。
因为它现在有多个线程,但它们执行相同的代码。我必须对内核进行哪些更改才能并行化?
对我并行化最明显的维度似乎是你内核中i
的循环,即在num_creatures
上迭代。所以我将描述如何做到这一点。
num_creatures
上的循环,而是让循环的每次迭代都由一个单独的CUDA线程处理。这是可能的,因为在每次循环迭代中完成的工作(大部分)是独立的 - 它不依赖于其他循环迭代的结果(但参见下面的2)。i
中的第二个num_creatures
for-loop可能取决于第一个的结果。如果我们将所有内容都保留为在单个线程中运行的串行代码,那么该依赖性将由串行代码执行的性质来处理。但是我们希望将其并行化。因此,我们需要在num_creatures
的第一个for循环和第二个循环之间进行全局同步。 CUDA中一个简单,方便的全局同步是内核启动,因此我们将内核代码分解为两个内核函数。我们称之为update1
和update2
cycles
中的拱形循环的挑战。我们不能简单地在两个内核中复制那个循环,因为这会改变函数行为 - 例如,在计算单个cycles
计算之前,我们会计算p_x
对delta_x
的更新。这大概不是想要的东西。因此,为简单起见,我们将这个循环从内核代码中提升回主机代码。然后主机代码将调用update1
和update2
内核进行cycles
迭代。num_creatures
。所以我们将为threadsperblock
选择一个硬编码的大小,但我们会根据num_creatures
的大小调整块的启动数量。为了实现这一点,我们需要在每个内核中进行线程检查(初始if语句),以便“额外”线程不做任何事情。有了这个描述,我们最终得到这样的东西:
$ cat t11.py
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, uint32)')
def update1(p_x, p_y, velocities, max_velocities, acceleration, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], uint32)')
def update2(p_x, p_y, radii, types, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
for y in range(1, 600 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
update1[blockspergrid, threadsperblock](device_p_x, device_p_y, device_velocities, device_max_velocities, acceleration, num_creatures)
update2[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, num_creatures)
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
$
它产生与原始发布版本相同的输出(原始代码运行16个64个线程的块完成相同的事情,并在写入相同数据时相互踩。所以我指的是原始发布的版本但运行一个线程的一个块)。
请注意,当然还有其他方法可以并行化,可能还有其他优化方法,但这应该为您提供CUDA工作的明智起点。
正如你在your previous question中提到的,这里的第二个内核确实没有做任何有用的事情,但我认为这只是未来工作的占位符。我也不确定你在这里使用radii
会得到你想要的东西,但这也不是这个问题的焦点。
那么所有这些表现明智的效果是什么?我们将再次比较原始发布版本(t12.py
,下面)运行一个线程的一个块(不是64个线程的16个块,这只会更糟,无论如何)对这个版本恰好运行18个64个线程的块( t11.py
,下面):
$ nvprof --print-gpu-trace python t11.py
==3551== NVPROF is profiling process 3551, command: python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
==3551== Profiling application: python t11.py
==3551== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
446.77ms 1.8240us - - - - - 7.8125KB 4.0847GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
446.97ms 1.7600us - - - - - 7.8125KB 4.2333GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.35ms 1.2160us - - - - - 12B 9.4113MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.74ms 1.3440us - - - - - 1.9531KB 1.3859GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
447.93ms 1.5040us - - - - - 5.8594KB 3.7154GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
448.13ms 1.5360us - - - - - 5.8594KB 3.6380GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
448.57ms 5.4720us (18 1 1) (64 1 1) 36 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update1$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int) [49]
448.82ms 1.1200us (18 1 1) (64 1 1) 8 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update2$242(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, unsigned int) [50]
448.90ms 2.1120us - - - - - 7.8125KB 3.5277GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ python t12.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
$ nvprof --print-gpu-trace python t12.py
==3604== NVPROF is profiling process 3604, command: python t12.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
==3604== Profiling application: python t12.py
==3604== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
296.22ms 1.8240us - - - - - 7.8125KB 4.0847GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
296.41ms 1.7920us - - - - - 7.8125KB 4.1577GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
296.79ms 1.2160us - - - - - 12B 9.4113MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.21ms 1.3440us - - - - - 1.9531KB 1.3859GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.40ms 1.5040us - - - - - 5.8594KB 3.7154GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
297.60ms 1.5360us - - - - - 5.8594KB 3.6380GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
298.05ms 1.8453ms (1 1 1) (1 1 1) 36 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::update$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<unsigned char, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>, float, unsigned int, unsigned int) [38]
299.91ms 2.1120us - - - - - 7.8125KB 3.5277GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
我们看到探测器报告原始的t12.py
版本,运行一个update
内核,有1个块和1个线程,它需要1.8453毫秒。对于修改后的t11.py
版本,发布在这个答案中,探测器报告了18个64个线程的块,对于update1
和update2
内核,这两个内核的组合执行时间大约是5.47 + 1.12 = 6.59微秒。
编辑:根据评论中的一些讨论,应该可以将两个内核组合到一个内核中,使用p_x
和p_y
上的双缓冲方案:
$ cat t11.py
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32)')
def update(p_x, p_y, p_x_new, p_y_new, radii, types, velocities, max_velocities, acceleration, num_creatures):
i = cuda.grid(1)
if i < num_creatures:
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x_new[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y_new[i] = p_y[i] + (math.sin(1.0) * velocities[i])
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500000
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 2
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 80000 // spacing):
for y in range(1, 6000 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_p_x_new = cuda.to_device(p_x)
device_p_y_new = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = (num_creatures // threadsperblock) + 1
for i in range(cycles):
if i % 2 == 0:
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_p_x_new, device_p_y_new, device_radii, device_types, device_velocities, device_max_velocities, acceleration, num_creatures)
else:
update[blockspergrid, threadsperblock](device_p_x_new, device_p_y_new, device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities, acceleration, num_creatures)
print(device_p_x_new.copy_to_host())
print(device_p_x.copy_to_host())
$ python t11.py
[ 20.05402946 20.05402946 20.05402946 ..., 0. 0. 0. ]
[ 20.1620903 20.1620903 20.1620903 ..., 0. 0. 0. ]
$
由于我们仍然需要内核调用提供的全局同步,因此仍然需要在主机代码中保留cycles
中的内核调用循环。但对于给定数量的cycles
,这将减少内核调用开销的贡献。
使用这种技术时,必须注意选择cycles
以及使用来自p_x
或p_x_new
缓冲液的数据,以获得连贯的结果。