迭代数据集时最佳选择和最佳实践是什么

Question

所以我尝试迭代一些眼底图像，但我的 ipynb 文件不断崩溃。只有 3662 张图像，是否有更优化的方式来迭代我的整个数据？这样做的唯一目的只是可视化和探索数据。因此，一旦数据被清理，我将使用数据加载器，但现在我想迭代整个数据集

#load in the train data - explore and clean
training_images=[]
missing_labels=[]
#count the sample distribution per class 
count={"No DR" :0 , "Mild DR": 0, "Moderate DR": 0, "Severe DR": 0, "Proliferative DR" : 0}



#iterate through the data in subsets, iterating through 200 images at a time
for chunk in np.array_split(df, 1000):
     for idx ,i in tqdm(chunk.iterrows(), total=len(chunk)):
        try:
            img_path = os.path.join(path, f"{i['id_code']}.png")
            img = cv2.imread(img_path, cv2.IMREAD_COLOR)
            label = i["diagnosis"]
        
            if pd.isna(label):
                #if any image within the csv file does not contain a label, we append the image to the missing_labels, in which we can later filter out
                missing_labels.append(np.array(img))

            #convert each image into a one hot encoding
            #np.eye() allows you define the number of output classes and the [] after it defines which class will be hot
            #np.eye(2)[0] for instance meaning [1,0], we have 2 classes and the 0th index is hot meaning that is our true label
            training_images.append([np.array(img),np.eye(5)[int(i["diagnosis"])]])    

        except Exception as e:
            print(f"There was an error processing  Image : {img_path}")

print(training_images[0][0].shape)
print(training_images[0])        

#as the label and image will be a list of lists, this should return a numpy array and its corresponding one hot encoded
print(f"The number of msising labels found in the dataset are : {len(missing_labels)}")

Answer 1

图像的分辨率会显着影响内存使用。如果图像为 1024x1024，则需要 11 GB RAM。当图像加载到 Python 中时，它们会被解压缩，这会增加内存负载。这是一件小事，但

cv.imread()

返回一个 numpy 数组，因此您不需要对其调用

np.array()

。

如果需要，将图像大小调整为较小的分辨率（可能是 224x224）
避免在图像上调用
```
np.array()
```

#load in the train data - explore and clean
training_images=[]
missing_labels=[]
#count the sample distribution per class 
count={"No DR" :0 , "Mild DR": 0, "Moderate DR": 0, "Severe DR": 0, "Proliferative DR" : 0}



#iterate through the data in subsets, iterating through 200 images at a time
for chunk in np.array_split(df, 1000):
     for idx ,i in tqdm(chunk.iterrows(), total=len(chunk)):
        try:
            img_path = os.path.join(path, f"{i['id_code']}.png")
            img = cv2.imread(img_path, cv2.IMREAD_COLOR)
            img = cv2.resize(img, (224, 224))
            label = i["diagnosis"]
        
            if pd.isna(label):
                #if any image within the csv file does not contain a label, we append the image to the missing_labels, in which we can later filter out
                missing_labels.append(img)

            #convert each image into a one hot encoding
            #np.eye() allows you define the number of output classes and the [] after it defines which class will be hot
            #np.eye(2)[0] for instance meaning [1,0], we have 2 classes and the 0th index is hot meaning that is our true label
            training_images.append([img,np.eye(5)[int(label)]])    

        except Exception as e:
            print(f"There was an error processing  Image : {img_path}")

print(training_images[0][0].shape)
print(training_images[0])        

#as the label and image will be a list of lists, this should return a numpy array and its corresponding one hot encoded
print(f"The number of msising labels found in the dataset are : {len(missing_labels)}")

最后的手段是将图像延迟加载到 PyTorch 中。您可以使用 PyTorch `Dataset

创建自定义数据集

迭代数据集时最佳选择和最佳实践是什么

问题描述投票：0回答：1

1个回答

最新问题

迭代数据集时最佳选择和最佳实践是什么

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1