为什么我的 DataLoader 进程使用高达 2.6GB 的虚拟内存,有什么方法可以减少它吗?
每个DataLoader进程占用2.6GB虚拟内存,4个进程占用10.4GB。
from transformers import AutoModelForZeroShotImageClassification, AutoProcessor
...
dataset = ImageDataset(image, clip_processor, SCAN_PROCESS_BATCH_SIZE, test_times)
dataloader = DataLoader(dataset, batch_size=SCAN_PROCESS_BATCH_SIZE, num_workers=4)
from torch.utils.data import Dataset
class ImageDataset(Dataset):
def __init__(self, image, processor, scan_process_batch_size, test_times):
self.image = image
self.processor = processor
self.scan_process_batch_size = scan_process_batch_size
self.test_times = test_times
def __len__(self):
return self.scan_process_batch_size * self.test_times
def __getitem__(self, idx):
# Use processor to process images
inputs = self.processor(
images=[self.image],
return_tensors="pt",
padding=True
)['pixel_values']
return inputs
我通过
memory
软件查看了process hacker
栏,private
栏里有大量的32MB行,我保存它,发现这32MB内存中全是0。
我发现代码之后虚拟内存增加了。
for i in range(self._num_workers):
# No certainty which module multiprocessing_context is
index_queue = multiprocessing_context.Queue() # type: ignore[var-annotated]
# Need to `cancel_join_thread` here!
# See sections (2) and (3b) above.
index_queue.cancel_join_thread()
w = multiprocessing_context.Process(
target=_utils.worker._worker_loop,
args=(self._dataset_kind, self._dataset, index_queue,
self._worker_result_queue, self._workers_done_event,
self._auto_collation, self._collate_fn, self._drop_last,
self._base_seed, self._worker_init_fn, i, self._num_workers,
self._persistent_workers, self._shared_seed))
w.daemon = True
# NB: Process.start() actually take some time as it needs to
# start a process and pass the arguments over via a pipe.
# Therefore, we only add a worker to self._workers list after
# it started, so that we do not call .join() if program dies
# before it starts, and __del__ tries to join but will get:
# AssertionError: can only join a started process.
w.start()
self._index_queues.append(index_queue)
self._workers.append(w)
我希望减少 num_workers 会减少内存使用量。例如
num_workers=1
应该只使用 1/4 的内存。请记住,这也会减少并行化,因为您将使用更少的 CPU 核心/线程。
另一个因素是批量大小。代码片段包含
batch_size=SCAN_PROCESS_BATCH_SIZE
,我不确定这个值的声明是什么,但减少它会减少加载到内存中的图像数量。
如果没有看到其余的代码和输入,就很难找到任何其他减少。需要注意的一件事是输入图像的大小以及是否可以减小(例如通过缩放图像变换)。