在Google Colab笔记本中使用huggingface load_dataset

问题描述 投票:0回答:2

我正在尝试在我的 Google Colab 笔记本中加载训练数据集,但不断收到错误。这种情况只发生在 Colab 中,因为当我在 VS Code 中运行同一个笔记本时,加载没有问题。

这是返回错误的代码片段:

dataset_id ="nielsr/funsd-layoutlmv3"

from datasets import load_dataset


dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

并且,在 Colab 中,返回:

Downloading and preparing dataset funsd-layoutlmv3/funsd to /root/.cache/huggingface/datasets/nielsr___funsd-layoutlmv3/funsd/1.0.0/0e3f4efdfd59aa1c3b4952c517894f7b1fc4d75c12ef01bcc8626a69e41c1bb9...
---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1569                 _time = time.time()
-> 1570                 for key, record in generator:
   1571                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

9 frames
UnidentifiedImageError: cannot identify image file '/root/.cache/huggingface/datasets/downloads/extracted/e5bbbc543f8cc95554da124f3e80a57ed24d67d06ae1467da5810703f851e3f9/dataset/training_data/images/0000971160.png'

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1604             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1605                 e = e.__context__
-> 1606             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1607 
   1608         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

我期待收到以下内容(我在 VS 中成功获得):

Train dataset size: 149
Test dataset size: 50

提前谢谢您!

dataset google-colaboratory huggingface-datasets
2个回答
2
投票

很可能文件在下载时已损坏。试试这个:

from datasets import load_dataset


dataset = load_dataset("nielsr/funsd-layoutlmv3", download_mode="force_redownload")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

它应该在 colab 上输出:

enter image description here


0
投票

此问题的解决方案之一可以在 HuggingFace 论坛中找到。尝试删除该目录中的.cache文件夹

~/.cache/huggingface/datasets
,默认情况下这是HuggingFace的.cache文件夹,然后尝试再次下载数据集。如果不起作用,这个其他解决方案可以帮助您(唯一对我有用)您是否应该可以查看
/root/.cache/huggingface/datasets/downloads/extracted/
目录中的一些数据集,尝试搜索一些 .json 文件引用您感兴趣的数据集,打开文件并可以访问直接获取数据集的 Google Drive URL 并手动下载。

© www.soinside.com 2019 - 2024. All rights reserved.