我通过 HuggingFace CLI 下载了托管在 HuggingFace 上的数据集,如下所示:
pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False
但是,下载的文件没有原始文件名。相反,它们的哈希值(git-sha 或 sha256,取决于它们是否是 LFS 文件)用作文件名:
--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
/..
12.9 GiB [##########] b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
3.9 GiB [### ] 86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
3.9 GiB [## ] f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
3.7 GiB [## ] e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
3.7 GiB [## ] 736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
3.7 GiB [## ] 0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
3.6 GiB [## ] 2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
3.5 GiB [## ] 2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
3.5 GiB [## ] caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
3.4 GiB [## ] dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
3.4 GiB [## ] f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
3.4 GiB [## ] 88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
3.4 GiB [## ] bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
3.3 GiB [## ] 83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
3.2 GiB [## ] e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
3.2 GiB [## ] 4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
3.2 GiB [## ] ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
3.1 GiB [## ] 5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
3.1 GiB [## ] cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
3.0 GiB [## ] 914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
2.9 GiB [## ] 24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
2.9 GiB [## ] 35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
2.8 GiB [## ] d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
2.8 GiB [## ] 1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
2.8 GiB [## ] 75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
2.8 GiB [## ] fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
Total disk usage: 184.8 GiB Apparent size: 184.8 GiB Items: 85
如何通过 HuggingFace CLI 下载 HuggingFace 数据集,同时保留原始文件名?
必须查看
snapshots
文件夹:
/home/username/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots
它包含原始的、可读的文件名。但是,这些文件是指向以哈希值作为文件名的 blob 文件的符号链接。人们可以将这些符号链接替换为实际文件(存储在 blob 中),并且它将为原始文件提供原始文件名。
要将符号链接替换为 Linux 上的实际文件,可以使用 u1686_grawity 的 script:
:script.sh
#!/bin/sh set -e for link; do test -h "$link" || continue dir=$(dirname "$link") reltarget=$(readlink "$link") case $reltarget in /*) abstarget=$reltarget;; *) abstarget=$dir/$reltarget;; esac rm -fv "$link" cp -afv "$abstarget" "$link" || { # on failure, restore the symlink rm -rfv "$link" ln -sfv "$reltarget" "$link" } done
运行:
find . -type l -exec /path/to/script.sh {} +