迭代数据集时最佳选择和最佳实践是什么

问题描述 投票:0回答:1

所以我尝试迭代一些眼底图像,但我的 ipynb 文件不断崩溃。只有 3662 张图像,是否有更优化的方式来迭代我的整个数据?这样做的唯一目的只是可视化和探索数据。因此,一旦数据被清理,我将使用数据加载器,但现在我想迭代整个数据集

#load in the train data - explore and clean
training_images=[]
missing_labels=[]
#count the sample distribution per class 
count={"No DR" :0 , "Mild DR": 0, "Moderate DR": 0, "Severe DR": 0, "Proliferative DR" : 0}



#iterate through the data in subsets, iterating through 200 images at a time
for chunk in np.array_split(df, 1000):
     for idx ,i in tqdm(chunk.iterrows(), total=len(chunk)):
        try:
            img_path = os.path.join(path, f"{i['id_code']}.png")
            img = cv2.imread(img_path, cv2.IMREAD_COLOR)
            label = i["diagnosis"]
        
            if pd.isna(label):
                #if any image within the csv file does not contain a label, we append the image to the missing_labels, in which we can later filter out
                missing_labels.append(np.array(img))

            #convert each image into a one hot encoding
            #np.eye() allows you define the number of output classes and the [] after it defines which class will be hot
            #np.eye(2)[0] for instance meaning [1,0], we have 2 classes and the 0th index is hot meaning that is our true label
            training_images.append([np.array(img),np.eye(5)[int(i["diagnosis"])]])    

        except Exception as e:
            print(f"There was an error processing  Image : {img_path}")

print(training_images[0][0].shape)
print(training_images[0])        

#as the label and image will be a list of lists, this should return a numpy array and its corresponding one hot encoded
print(f"The number of msising labels found in the dataset are : {len(missing_labels)}")

python deep-learning pytorch neural-network dataset
1个回答
0
投票

图像的分辨率会显着影响内存使用。如果图像为 1024x1024,则需要 11 GB RAM。当图像加载到 Python 中时,它们会被解压缩,这会增加内存负载。这是一件小事,但

cv.imread()
返回一个 numpy 数组,因此您不需要对其调用
np.array()

  1. 如果需要,将图像大小调整为较小的分辨率(可能是 224x224)
  2. 避免在图像上调用
    np.array()
#load in the train data - explore and clean
training_images=[]
missing_labels=[]
#count the sample distribution per class 
count={"No DR" :0 , "Mild DR": 0, "Moderate DR": 0, "Severe DR": 0, "Proliferative DR" : 0}



#iterate through the data in subsets, iterating through 200 images at a time
for chunk in np.array_split(df, 1000):
     for idx ,i in tqdm(chunk.iterrows(), total=len(chunk)):
        try:
            img_path = os.path.join(path, f"{i['id_code']}.png")
            img = cv2.imread(img_path, cv2.IMREAD_COLOR)
            img = cv2.resize(img, (224, 224))
            label = i["diagnosis"]
        
            if pd.isna(label):
                #if any image within the csv file does not contain a label, we append the image to the missing_labels, in which we can later filter out
                missing_labels.append(img)

            #convert each image into a one hot encoding
            #np.eye() allows you define the number of output classes and the [] after it defines which class will be hot
            #np.eye(2)[0] for instance meaning [1,0], we have 2 classes and the 0th index is hot meaning that is our true label
            training_images.append([img,np.eye(5)[int(label)]])    

        except Exception as e:
            print(f"There was an error processing  Image : {img_path}")

print(training_images[0][0].shape)
print(training_images[0])        

#as the label and image will be a list of lists, this should return a numpy array and its corresponding one hot encoded
print(f"The number of msising labels found in the dataset are : {len(missing_labels)}")

最后的手段是将图像延迟加载到 PyTorch 中。您可以使用 PyTorch `Dataset

创建自定义数据集
© www.soinside.com 2019 - 2024. All rights reserved.