我正在按照教程来微调模型,但陷入了无法解决的 load_dataset 错误。作为上下文,本教程首先将this数据集上传到 HF,然后我成功上传了一个相同的。
但是,当我运行脚本下载数据集时,出现了问题。如果我正在下载原始数据集,则过程会顺利进行,并且所有文件都会正确获取。但是当我尝试下载我的文件时,似乎我要下载部分文件(直到您在错误消息中看到 0.0.0 文件夹,但此后什么也没有)。
我运行的命令是
dataset = load_dataset("FelipeBandeiraPoatek/invoices-donut-data-v2", split="train")
,我收到的错误日志如下:
Downloading data files: 100%|████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|████████████████████████████████████████| 3/3 [00:00<00:00, 198.67it/s]
Traceback (most recent call last):
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1852, in _prepare_split_single
writer = writer_class(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\arrow_writer.py", line 334, in __init__
self.stream = self._fs.open(fs_token_paths[2][0], "wb")
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\spec.py", line 1241, in open
f = self._open(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 184, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 315, in __init__
self._open()
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 320, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Felipe Bandeira/.cache/huggingface/datasets/FelipeBandeiraPoatek___parquet/FelipeBandeiraPoatek--invoices-donut-data-v2-ca49e83826870faf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 11, in <module>
main()
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 7, in main
dataset_tester.test("FelipeBandeiraPoatek/invoices-donut-data-v2")
File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\tools\donut\dataset_tester.py", line 10, in test
dataset = load_dataset(dataset_name, split="train")
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1749, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1892, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
我还没有找到任何解决方案,也无法弄清楚为什么原始数据集下载得很好,但我的(相同的)却没有。有线索吗?
(我试过了:
在所有情况下,我都可以从原始存储库正确下载数据集,但不能从我自己的存储库下载。同样的错误不断发生)
我也遇到了这个问题,最后我发现文件名太长超出了系统命名长度限制,将文件名改短一点就行了!