我正在尝试使用 Python 将包含 50000 张图像的大文件夹从我的 GDrive 下载到本地服务器中。以下代码收到限制错误。还有其他解决方案吗?
import gdown
url = 'https://drive.google.com/drive/folders/135hTTURfjn43fo4f?usp=sharing' # I'm showing a fake token
gdown.download_folder(url)
无法检索文件夹内容:
带有 url 的 gdrive 文件夹: https://drive.google.com/drive/folders/135hTTURfjn43fo4f?usp=sharing 至少有 50 个文件,gdrive 无法下载超过此限制的文件,如果 您对此没问题,请使用 --remaining-ok 标志再次运行。
正如 kite 在评论中提到的那样,将其与
remaining_ok
标志一起使用。
gdown.download_folder(url, remaining_ok=True)
这在 https://pypi.org/project/gdown/ 中没有提到,所以可能会有任何混淆。
除了警告和
此 github 代码之外,
remaining_ok
上的任何参考资料均不可用。
似乎
gdown
被严格限制为50个文件,并且还没有找到绕过它的方法。
如果除了
gdown
之外还有其他选项,请参阅下面的代码。
import io
import os
import os.path
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google.oauth2 import service_account
credential_json = {
### Create a service account and use its the json content here ###
### https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account
### credentials.json looks like this:
"type": "service_account",
"project_id": "*********",
"private_key_id": "*********",
"private_key": "-----BEGIN PRIVATE KEY-----\n*********\n-----END PRIVATE KEY-----\n",
"client_email": "service-account@*********.iam.gserviceaccount.com",
"client_id": "*********",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account%40*********.iam.gserviceaccount.com"
}
credentials = service_account.Credentials.from_service_account_info(credential_json)
drive_service = build('drive', 'v3', credentials=credentials)
folderId = '### Google Drive Folder ID ###'
outputFolder = 'output'
# Create folder if not existing
if not os.path.isdir(outputFolder):
os.mkdir(outputFolder)
items = []
pageToken = ""
while pageToken is not None:
response = drive_service.files().list(q="'" + folderId + "' in parents", pageSize=1000, pageToken=pageToken,
fields="nextPageToken, files(id, name)").execute()
items.extend(response.get('files', []))
pageToken = response.get('nextPageToken')
for file in items:
file_id = file['id']
file_name = file['name']
request = drive_service.files().get_media(fileId=file_id)
### Saves all files under outputFolder
fh = io.FileIO(outputFolder + '/' + file_name, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print(f'{file_name} downloaded completely.')
下载限制设置在
../gdown/download_folder.py
这是我用来使用 gdown 下载网址的解决方法
import re
import os
urls = <copied_urls>
url_list = urls.split(', ')
pat = re.compile('https://drive.google.com/file/d/(.*)/view\?usp=sharing')
for url in url_list:
g = re.match(pat,url)
id = g.group(1)
down_url = f'https://drive.google.com/uc?id={id}'
os.system(f'gdown {down_url}')
注意:此解决方案对于 50000 张图像并不理想,因为复制的 url 字符串太大。如果您的字符串很大,请将其复制到文件中并对其进行处理,而不是使用变量。就我而言,我必须复制 75 个大文件
我试图通过CLI从Google Drive下载CORDv0,并且没有其他好的方法可以一行下载。 最好的方法是将文件夹作为 zip 存档保存到磁盘上,然后作为统一文件下载。
在某些情况下,更改下载限制的想法会有所帮助。在colab中,我使用:
!pip uninstall gdown --yes
!cd .. && git clone https://github.com/wkentaro/gdown
with open('../gdown/gdown/download_folder.py', 'r') as f:
code = f.read().replace('MAX_NUMBER_FILES = 50', 'MAX_NUMBER_FILES = 10000')
with open('../gdown/gdown/download_folder.py', 'w') as f:
f.write(code)
!cd ../gdown && pip install -e . --no-cache-dir
!pip show gdown
但请记住 gdown 错误。正如前面提到的,gdown lib 不是最好的选择。
!pip uninstall --yes gdown # After running this line, restart Colab runtime.
!pip install gdown -U --no-cache-dir
import gdown
url = r'https://drive.google.com/drive/folders/1sWD6urkwyZo8ZyZBJoJw40eKK0jDNEni'
gdown.download_folder(url)