我想训练一个面食分类器,目前正在预处理数据。我将其分为训练集、验证集和测试集,并使用以下代码将其保存到我的笔记本电脑上:
datasets.DatasetDict( { "train": train_dataset, "valid": valid_dataset, "test": test_dataset, } ).save_to_disk("split_pasta_dataset")
到目前为止,一切都很好。现在,我创建了一个新的 Jupyter Notebook,我将在其中开始构建模型。但首先,我需要从笔记本电脑加载数据集。我尝试通过以下方式做到这一点:
dataset = datasets.load_from_disk('split_pasta_dataset')
我使用这种方法是因为它对于我过去从事的另一个分类项目效果很好。但是,当我现在尝试执行此操作时,它不起作用并且出现此错误:
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 2
1 # Loading the dataset from the laptop
----> 2 dataset = datasets.load_from_disk('split_pasta_dataset')
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/datasets/load.py:1892, in load_from_disk(dataset_path, fs, keep_in_memory, storage_options)
1890 return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
1891 elif fs.isfile(path_join(dest_dataset_path, config.DATASETDICT_JSON_FILENAME)):
-> 1892 return DatasetDict.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
1893 else:
1894 raise FileNotFoundError(
1895 f"Directory {dataset_path} is neither a Dataset directory nor a DatasetDict directory."
1896 )
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/datasets/dataset_dict.py:1319, in DatasetDict.load_from_disk(dataset_dict_path, fs, keep_in_memory, storage_options)
1313 for k in splits:
1314 dataset_dict_split_path = (
1315 dataset_dict_path.split("://")[0] + "://" + path_join(dest_dataset_dict_path, k)
1316 if is_remote_filesystem(fs)
1317 else path_join(dest_dataset_dict_path, k)
1318 )
-> 1319 dataset_dict[k] = Dataset.load_from_disk(
1320 dataset_dict_split_path, keep_in_memory=keep_in_memory, storage_options=storage_options
1321 )
1322 return dataset_dict
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/datasets/arrow_dataset.py:1627, in Dataset.load_from_disk(dataset_path, fs, keep_in_memory, storage_options)
1620 warnings.warn(
1621 "'fs' was deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0.\n"
1622 "You can remove this warning by passing 'storage_options=fs.storage_options' instead.",
1623 FutureWarning,
1624 )
1625 storage_options = fs.storage_options
-> 1627 fs_token_paths = fsspec.get_fs_token_paths(dataset_path, storage_options=storage_options)
1628 fs: fsspec.AbstractFileSystem = fs_token_paths[0]
1630 if is_remote_filesystem(fs):
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/fsspec/core.py:610, in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
608 if protocol:
609 storage_options["protocol"] = protocol
--> 610 chain =
un_chain(urlpath0, storage_options or {})
611 inkwargs = {}
612 # Reverse iterate the chain, creating a nested target
* structure
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/fsspec/core.py:325, in _un_chain(path, kwargs)
323 for bit in reversed(bits):
324 protocol = kwargs.pop("protocol", None) or split_protocol(bit)[0] or "file"
--> 325 cls = get_filesystem_class(protocol)
326 extra_kwargs = cls._get_kwargs_from_urls(bit)
327 kws = kwargs.pop(protocol, {})
File /opt/homebrew/anaconda3/envs/pasta_classifier/lib/python3.11/site-packages/fsspec/registry.py:232, in get_filesystem_class(protocol)
230 if protocol not in registry:
231 if protocol not in known_implementations:
--> 232 raise ValueError(f"Protocol not known: {protocol}")
233 bit = known_implementations[protocol]
234 try:
ValueError: Protocol not known: split_pasta_dataset`
我尝试重新启动内核并清除所有单元中的输出,并尝试再次运行单元,但出现了相同的错误。我还尝试检查我是否写了确切的文件名,它似乎是正确的。我也尝试用谷歌搜索该错误,但无法找到有关此主题的类似讨论。有人可以帮我解决这个问题吗?
这是 fsspec 版本的问题。
尝试一下
pip install fsspec==2023.6.0
。
另请阅读huggingface 上的此问题:https://github.com/huggingface/datasets/issues/6353