我有一个 pytorch
Dataset
子类,我用它创建了一个 pytorch DataLoader
。当我从 DataSet 的 __getitem__()
方法返回两个张量时,它就起作用了。我尝试创建最小的(但不起作用,稍后会详细介绍)代码如下:
import torch
from torch.utils.data import Dataset
import random
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class DummyDataset(Dataset):
def __init__(self, num_samples=3908, window=10): # same default values as in the original code
self.window = window
# Create dummy data
self.x = torch.randn(num_samples, 10, dtype=torch.float32, device='cpu')
self.y = torch.randn(num_samples, 3, dtype=torch.float32, device='cpu')
self.t = {i: random.choice([True, False]) for i in range(num_samples)}
def __len__(self):
return len(self.x) - self.window + 1
def __getitem__(self, i):
return self.x[i: i + self.window], self.y[i + self.window - 1] #, self.t[i]
ds = DummyDataset()
dl = torch.utils.data.DataLoader(ds, batch_size=10, shuffle=False, generator=torch.Generator(device='cuda'), num_workers=4, prefetch_factor=16)
for data in dl:
x = data[0]
y = data[1]
# t = data[2]
print(f"x: {x.shape}, y: {y.shape}") # , t: {t}
break
上面的代码给出以下错误:
RuntimeError: Expected a 'cpu' device type for generator but found 'cuda'
上线
for data in dl:
。
但我的原始代码与上面完全相同:数据集包含在
cpu
上创建的张量,并且数据加载器的生成器设备设置为 cuda
并且它可以工作(我的意思是上面的最小代码不起作用,但我原始代码中的相同行确实有效!)。
当我尝试通过从
, self.t[i]
方法取消注释 __get_item__()
来从中返回布尔值时,它给了我以下错误:
Traceback (most recent call last):
File "/my_project/src/train.py", line 66, in <module>
trainer.train_validate()
File "/my_project/src/trainer_cpu.py", line 146, in train_validate
self.train()
File "/my_project/src/trainer_cpu.py", line 296, in train
for train_data in tqdm(self.train_dataloader, desc=">> train", mininterval=5):
File "/usr/local/lib/python3.9/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
data.reraise()
File "/usr/local/lib/python3.9/site-packages/torch/_utils.py", line 706, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 317, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 174, in <listcomp>
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 146, in collate
return collate_fn_map[collate_type](batch, collate_fn_map=collate_fn_map)
File "/usr/local/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 235, in collate_int_fn
return torch.tensor(batch)
File "/usr/local/lib/python3.9/site-packages/torch/utils/_device.py", line 79, in __torch_function__
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 300, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
为什么会这样?为什么它不允许我从
__get_item__
返回额外的布尔值?
PS:
以上是主要问题。然而,我注意到一些奇怪的观察结果:如果我将
, self.t[i]
的发电机设备从 DalaLoader
替换为 cuda
,上面的代码(带或不带 cpu
注释)就会开始工作!也就是说,如果我用 generator=torch.Generator(device='cuda')
替换 generator=torch.Generator(device='cpu')
,它会输出:
x: torch.Size([10, 10, 10]), y: torch.Size([10, 3])
如果我在原始代码中执行相同的操作,则会出现以下错误:
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
上线
for data in dl:
。
更新
当我将
self.t
的类型从 python dict
更改为 bool 类型的火炬张量并将其移动到 cpu 时,它就开始工作了:
self.t = torch.tensor([random.choice([True, False]) for _ in range(num_samples)], dtype=torch.bool).to('cpu')
请解释原因。
使用
torch.Generator(device='cpu')
。
您不应该在数据加载器中执行任何与 cuda 相关的操作,尤其是当它运行多个工作进程时。从数据加载器中提取一批数据,并在整理输出后将其移至 cuda。
数据加载器的生成器设置用于采样的 RNG 状态。它需要一个 CPU 张量。这就是你收到错误的原因
RuntimeError: Expected a 'cpu' device type for generator but found 'cuda'
Cuda 生成器用于在 cuda 进程内的 GPU 上生成随机数 - 它们不应用于数据加载器。
Cannot re-initialize CUDA in forked subprocess
是由于尝试在分叉进程内执行cuda操作而引起的。没有原始代码很难说,但这可能是由 getitem 返回 cuda 张量引起的。