我正在尝试在我的 Google Colab 笔记本中加载训练数据集,但不断收到错误。这种情况只发生在 Colab 中,因为当我在 VS Code 中运行同一个笔记本时,加载没有问题。
这是返回错误的代码片段:
dataset_id ="nielsr/funsd-layoutlmv3"
from datasets import load_dataset
dataset = load_dataset(dataset_id)
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
并且,在 Colab 中,返回:
Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...
---------------------------------------------------------------------------
UnidentifiedImageError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1569 _time = time.time()
-> 1570 for key, record in generator:
1571 if max_shard_size is not None and writer._num_bytes > max_shard_size:
9 frames
UnidentifiedImageError: cannot identify image file '/root/.cache/huggingface/datasets/downloads/extracted/e5bbbc543f8cc95554da124f3e80a57ed24d67d06ae1467da5810703f851e3f9/dataset/training_data/images/0000971160.png'
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1604 if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
1605 e = e.__context__
-> 1606 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1607
1608 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
我期待收到以下内容(我在 VS 中成功获得):
Train dataset size: 149
Test dataset size: 50
提前谢谢您!
此问题的解决方案之一可以在 HuggingFace 论坛中找到。尝试删除该目录中的.cache文件夹
~/.cache/huggingface/datasets
,默认情况下这是HuggingFace的.cache文件夹,然后尝试再次下载数据集。如果不起作用,这个其他解决方案可以帮助您(唯一对我有用)您是否应该可以查看 /root/.cache/huggingface/datasets/downloads/extracted/
目录中的一些数据集,尝试搜索一些 .json 文件引用您感兴趣的数据集,打开文件并可以访问直接获取数据集的 Google Drive URL 并手动下载。