使用 Numba 和 Cuda 在 python 中创建进度条

Question

我正在使用 numba 和 CUDA（在 Windows 上）运行并行进程，这将需要相当长的时间。最好在控制台中打印一个更新进度条，这样我就可以看到所有线程的进度。像 tqdm 这样的东西绝对是完美的，但对于 CUDA 来说。

我尝试过使用 tqdm 和 numba-progress，但似乎都不适用于 CUDA。我也尝试过我自己的基于类的解决方案，但是遗憾的是，您不能将类传递到内核函数中（我认为）。我发现这个thread也描述了我想解决的问题，但没有回复。我发现的所有其他帖子都不是针对 CUDA 的。

这是我想要添加进度条的一些框架代码：

from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil

        
# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate):
    thread_id = cuda.grid(1)

    def poission_sample(rate, random_number): 
        probability_sum = 0
        index = -1
        while probability_sum < random_number:
            index += 1
            probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
            
        return index
    
    # Ideally increment a global counter of some kind here, or have a module that does it for me
    
    out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id))

number_of_samples = 10000000

threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)

generate_samples[blocks, threads_per_block](rng_states, out, 5)
    
print('Average Sample:', out.mean())

任何帮助将不胜感激！

Answer 1

您可以使用 numba cuda mapped_array 来帮助完成此任务。在幕后，这告诉 numba 创建一个固定分配并使其在设备上可用，这通知 numba 不要将其复制到设备，即使它是一个 pinned_array 通常对 numba 来说就像主机数组一样。

此外，我们需要确保 numba 不会尝试复制数组，因为这将导致“自动”情况下的同步，这是我们不希望的。

我真的不知道如何衡量该算法的进度。例如，

poisson_sample

中的 while 循环似乎对

thread_id

为零的项目迭代 4 次，但我怀疑在

out

数组中是否如此。（我确实对如何监控其他算法的进度有更好的想法。）

如果我们根据进度知道算法应该花费多长时间，那么我们可以简单地监视内核报告的值。当达到 100%（或接近）时，我们停止监控并继续其余工作。

出于演示目的，我将任意决定该算法的进度是通过已完成工作的线程数来衡量的。

当我们无法根据内核的进度报告确定进度时（例如，无论如何，对我来说，您的情况），那么另一种选择是继续监视和报告进度，直到通过 event 发出内核完成信号。

无论如何，下面的内容对我来说在 Linux 上是有效的，作为一个粗略的草图。这是使用事件进行演示，尽管如果您知道算法的进度，则实际上并不需要事件。这是包含事件的版本：

$ cat t1.py
from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil


# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate, progress):
    thread_id = cuda.grid(1)

    def poission_sample(rate, random_number, progress):
        probability_sum = 0
        index = -1
        while probability_sum < random_number:
            index += 1
            probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
        cuda.atomic.add(progress, 0, 1)
        return index

    # Ideally increment a global counter of some kind here, or have a module that does it for me

    out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)

number_of_samples = 10000000
progress = cuda.mapped_array(1, dtype=np.int64)
progress[0] = 0;
last_pct = 0
my_e = cuda.event()
threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
my_divisor = (threads_per_block * blocks) // 100
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
out_d = cuda.device_array_like(out)
generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
my_e.record()
print(last_pct)
while my_e.query() == False:
    cur_pct = progress[0]/my_divisor
    if cur_pct > last_pct + 10:
        last_pct = cur_pct
        print(cur_pct)
out = out_d.copy_to_host()

print('Average Sample:', out.mean())
$ python3 t1.py
0
10.00129996100117
20.00291991240263
30.004539863804087
40.00519984400468
50.00713978580642
60.00811975640731
70.00941971740848
80.01039968800936
90.01105966820995
Average Sample: 5.000568
$

这是一个没有事件的版本：

$ cat t2.py
from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil


# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate, progress):
    thread_id = cuda.grid(1)

    def poission_sample(rate, random_number, progress):
        probability_sum = 0
        index = -1
        while probability_sum < random_number:
            index += 1
            probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
        cuda.atomic.add(progress, 0, 1)
        return index

    # Ideally increment a global counter of some kind here, or have a module that does it for me

    out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)

number_of_samples = 10000000
progress = cuda.mapped_array(1, dtype=np.int64)
progress[0] = 0;
last_pct = 0
threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
my_divisor = (threads_per_block * blocks) // 100
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
out_d = cuda.device_array_like(out)
generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
print(last_pct)
while last_pct < 90:
    cur_pct = progress[0]/my_divisor
    if cur_pct > last_pct + 10:
        last_pct = cur_pct
        print(cur_pct)
out = out_d.copy_to_host()

print('Average Sample:', out.mean())
$ python3 t2.py
0
10.000019999400019
20.000039998800037
30.000059998200054
40.000079997600075
50.00009999700009
60.00011999640011
70.00013999580013
80.00015999520015
90.00017999460016
Average Sample: 5.000568
$

我在 Linux 上运行了这两个程序。不使用事件的版本可能在 Windows 上工作得更好，或者可能以其他方式工作（事件查询可能会推动工作提交）。如果您在 Windows 上使用显示 GPU（即 GPU 不处于 TCC 模式），则 WDDM 工作批处理/调度可能会出现问题。您可以尝试 Windows 硬件加速 GPU 调度的两种设置，看看其中一个选项是否比另一个选项效果更好。

此外，该内核在我的 GPU 上运行时间不到一秒（实际上，在我的 GTX 970 GPU 上，内核持续时间约为 300 毫秒）。所以这可能不是一个有趣的测试用例。

使用 Numba 和 Cuda 在 python 中创建进度条

问题描述投票：0回答：1

1个回答

最新问题

使用 Numba 和 Cuda 在 python 中创建进度条

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1