避免重新加载 Pytorch 数据集

Question

我在相对稳定的数据集组合上训练 CNN，但每次开始训练作业时，训练器都会等待 5-10 分钟从磁盘加载我的数据帧。是否可以避免此步骤并仅加载一次数据并且必须可供多个进程访问？

假设我可以启动并拆除多个训练作业，修改训练参数、拓扑，但数据将是相同的。

我认为客户端-服务器解决方案可能是相关的，其中一个进程加载一次数据集，然后为来自多个训练客户端的请求提供服务。这以前做过吗？是否有现成的框架已经实现了这一点？

Answer 1

嗯，我之前从未尝试过，我只是耐心等待缓存实例加载哈哈但我确实通过谷歌搜索找到了一些东西，尝试使用redis。它是一个 Python 库，允许您在 RAM 中存储内容并检索它们。我自己从未尝试过，但这里有一些简单的代码来运行它，例如：

df = pandas.read_csv('dataset.csv')
r = redis.Redis(host="localhost", port=8888, db=0) # start a localhost server on this port 
for index, row in df.iterrows():
    r.hmset(f"row:{index}", row_to_dict()) #load data in RAM

然后在另一个程序中：

data = r.hgetall("row:0") #OR you can try iterating through the data

还有另一种使用多处理共享内存的实现。我从 gpt 复制了这个，但对我来说是合法的。

import numpy as np
from multiprocessing import shared_memory

# Create some data (example: an array)
data = np.array([1, 2, 3, 4, 5], dtype=np.int32)

# Create shared memory block
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)

# Create a NumPy array using the shared memory buffer
shared_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)

# Copy data to shared memory
np.copyto(shared_array, data)

print("Shared memory block created with name:", shm.name)
print("Data written to shared memory:", shared_array)

# Remember to close the shared memory if it's no longer needed in this process

在另一个程序中：

import numpy as np
from multiprocessing import shared_memory

# Replace 'your_shared_memory_name' with the name printed by the first script
shm_name = "your_shared_memory_name"  # Use the name from the first script

# Access the existing shared memory block
existing_shm = shared_memory.SharedMemory(name=shm_name)

# Create a NumPy array using the shared memory buffer
shared_array = np.ndarray((5,), dtype=np.int32, buffer=existing_shm.buf)

# Read the data from shared memory
print("Data read from shared memory:", shared_array)

# Clean up
existing_shm.close()

同样，这可以解决您的问题。尝试让我知道 :D

避免重新加载 Pytorch 数据集

问题描述投票：0回答：1

1个回答

最新问题

避免重新加载 Pytorch 数据集

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1