我在使用
deltalake
库将数据保存/加载数据到 Azure Blob 存储时遇到问题。有时,我会收到以下错误:
DatasetError: Failed while saving data to data set CustomDeltaTableDataset(file_example).
Failed to parse parquet: Parquet error: AsyncChunkReader::get_bytes error:
Generic MicrosoftAzure error: Error after 10 retries in 2.196683949s, max_retries:10,
retry_timeout:180s, source:error sending request for url
(https://<address>/file.parquet):
error trying to connect: dns error: failed to lookup address information: Name or service not known
这是我正在使用的参数的示例:
from deltalake import DeltaTable
datalake_vale = {
'account_name': <account>,
'client_id': <cli_id>,
'tenant_id': <tenant_id>,
'client_secret': <secret>,
'timeout': '100000s'
}
# Load data from the delta table
dt = DeltaTable("abfs://<azure_address>", storage_options=datalake_vale)
我正在寻找像 max_retries 这样的参数,但找不到任何相关的内容。有谁知道这个问题的解决方案或解决方法吗?
预先感谢您的帮助!
您可以控制重试次数和超时次数,如下所示:
datalake_vale = {
'account_name': account_name,
'client_id': client_id,
'tenant_id': tenant_id,
'client_secret': client_secret,
'timeout': '100000s',
'retries': '20',
'retry_delay': '2',
}
以下是完整代码供您参考:
from deltalake import DeltaTable
import fsspec
# Azure Blob Storage configuration
account_name = '<accountName>'
client_id = '<clientId>'
tenant_id = '<tenantId>'
client_secret = '<ClientSecret>'
container_name = '<containerName>'
# Construct storage_options dictionary with retry settings
datalake_vale = {
'account_name': account_name,
'client_id': client_id,
'tenant_id': tenant_id,
'client_secret': client_secret,
'timeout': '100000s',
'retries': '20',
'retry_delay': '2',
}
# Azure Blob Storage path for Delta Table
delta_table_path = f"abfss://{container_name}@{account_name}.dfs.core.windows.net/<deltaTablePath>"
# Load DeltaTable with storage_options
dt = DeltaTable(delta_table_path, storage_options=datalake_vale)
# Example: Retrieve and print schema
print(dt.schema())
您可以得到如下所示的输出:
增量表将成功加载,您可以更新它。确保您使用的是 ADLS 帐户。它不适用于 blob 存储。