我有一个 terraform 代码来部署 Databricks 工作区。
resource "azurerm_databricks_workspace" "databricks" {
resource_group_name = var.resource_group_name
location = var.context.location # West Europe
name = local.dbw_name
managed_resource_group_name = local.mrg_name
sku = "premium" # Needed for private endpoint
public_network_access_enabled = false
network_security_group_rules_required = "NoAzureDatabricksRules" # https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/private-link#--step-3-provision-an-azure-databricks-workspace-and-private-endpoints
custom_parameters {
no_public_ip = true # Security constrain
virtual_network_id = local.vnet_id
public_subnet_name = local.container_subnet_name
public_subnet_network_security_group_association_id = var.subnet_configuration_for_container.network_security_group_id
private_subnet_name = local.host_subnet_name
private_subnet_network_security_group_association_id = var.subnet_configuration_for_host.network_security_group_id
storage_account_name = local.st_name
}
tags = merge(local.default_tags, { managed_databricks = true })
lifecycle {
ignore_changes = [
tags["availability"],
tags["confidentiality"],
tags["integrity"],
tags["spoke_type"],
tags["traceability"],
]
precondition {
condition = length(local.dbw_name) < 64
error_message = "The Databricks resource name must be no longer than 64 characters. Please shorten the `instance` variable."
}
}
depends_on = [
data.azapi_resource.subnet["host"],
data.azapi_resource.subnet["container"]
]
}
我们还有 2 个用于 webapp dbw 和 webauth dbw 的专用端点。然后,两者都会在我们的自定义 DNS 中注册,以便主机和容器的 URL/IP 可在我们的网络内访问。
在订阅 A 上部署此代码时,我们遇到了 0 个问题。集群启动正确、快速、无超时。 但是,当在与 A 相同的订阅 B 和 C 上部署时,我们遇到了问题。 A/B/C 之间的唯一区别是名称/ID(相同的tenantID)、策略相同、terraform 代码相同、DNS/防火墙/代理相同。
在订阅 B/C 上,当尝试启动集群/作业/dbt 时,我们在启动它们时遇到问题。但并非总是如此。 这是 Databricks UI 上的错误:
Failed to get instance bootstrap steps from the Databricks Control Plane.
Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap failed command: GetRunbook Failure message: (Base64 encoded) XXXXXXXXXXXXXXX
VM extension code: ProvisioningState/succeeded instanceId:
InstanceId(aca79a0fb49e4c808700af118638e8ac)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated): [Bootstrap Event] Command DownloadBootstrapScript finished.
Storage Account: arprodwesteua4.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 4.69204 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.0390388965607 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0142250061035 2024/07/10 09:10:11
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0148358345032 2024/07/10 09:10:30
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0136380195618 2024/07/10 09:10:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014701128006 2024/07/10 09:11:25
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0169570446014 2024/07/10 09:12:12
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0156190395355 2024/07/10 09:13:31
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0194149017334 2024/07/10 09:15:55
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014839887619 2024/07/10 09:20:26
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0163550376892 2024/07/10 09:20:41
INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.803928137 2024/07/10 09:20:41
INFO vm_bootstrap.py:240: [Bootstrap Event] {FAILED_COMMAND:GetRunbook} 2024/07/10 09:20:41
INFO vm_bootstrap.py:242: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) XXXXXXXXXXX } 2024/07/10 09:20:41
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0184330940247
当在我的计算机上启动 SparkSession 以到达集群时,我得到以下信息:
24/07/11 08:32:25 WARN HTTPClient: Excluding proxy hosts for HTTP client based on env var no_proxy=localhost,10.0.0.0/8,storageA.blob.core.windows.net,storageB.blob.core.windows.net,storageC.blob.core.windows.net,storageD.blob.core.windows.net,storageE.blob.core.windows.net,storageF.blob.core.windows.net,storageG.blob.core.windows.net,.dev.azuresynapse.net,.azuresynapse.net,.table.core.windows.net,.queue.core.windows.net,.file.core.windows.net,.web.core.windows.net,.dfs.core.windows.net,.documents.azure.com,.batch.azure.com,.service.batch.azure.com,.vault.azure.net,.vaultcore.azure.net,.managedhsm.azure.net,.azmk8s.io,.search.windows.net,.azurecr.io,.azconfig.io,.servicebus.windows.net,.azure-devices.net,.servicebus.windows.net,.azure-devices-provisioning.net,.eventgrid.azure.net,.azurewebsites.net,.scm.azurewebsites.net,.api.azureml.ms,.notebooks.azure.net,.instances.azureml.ms,.aznbcontent.net,.inference.ml.azure.com,.cognitiveservices.azure.com,.afs.azure.net,.datafactory.azure.net,.adf.azure.com,.purview.azure.com,.azure-api.net,.developer.azure-api.net,.analysis.windows.net,.azuredatabricks.net,.dev.azure.com,.azurefd.net,.vsblob.vsassets.io,otr.dtc3.cf.saint-gobain.net,.openai.azure.com
24/07/11 08:32:26 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state TERMINATED, waiting for it to start running...
24/07/11 08:32:37 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:22 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:32 ERROR SparkClientManager: Fail to get the SparkClient
java.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"
The cluster ID you specified does not correspond to any existing cluster.
Cluster ID: The ID of the cluster on which you want to run your code
- This should look like 0123-456789-abcd012
- Get current value: spark.conf.get("spark.databricks.service.clusterId")
- Set via conf: spark.conf.set("spark.databricks.service.clusterId", <your cluster ID>)
- Set via environment variable: export DATABRICKS_CLUSTER_ID=<your cluster ID>
at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2193)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:3932)
at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
at com.databricks.service.SparkClientManager.liftedTree1$1(SparkClient.scala:377)
at com.databricks.service.SparkClientManager.getForSession(SparkClient.scala:376)
at com.databricks.service.SparkClientManager.getForSession$(SparkClient.scala:353)
at com.databricks.service.SparkClientManager$.getForSession(SparkClient.scala:401)
at com.databricks.service.SparkClientManager.getForCurrentSession(SparkClient.scala:351)
at com.databricks.service.SparkClientManager.getForCurrentSession$(SparkClient.scala:351)
at com.databricks.service.SparkClientManager$.getForCurrentSession(SparkClient.scala:401)
at com.databricks.service.SparkClient$.getServerHadoopConf(SparkClient.scala:297)
at com.databricks.spark.util.SparkClientContext$.getServerHadoopConf(SparkClientContext.scala:281)
at org.apache.spark.SparkContext.$anonfun$hadoopConfiguration$1(SparkContext.scala:407)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.SparkContext.hadoopConfiguration(SparkContext.scala:398)
at com.databricks.sql.DatabricksEdge.catalog(DatabricksEdge.scala:198)
at com.databricks.sql.DatabricksEdge.catalog$(DatabricksEdge.scala:197)
at org.apache.spark.sql.internal.SessionStateBuilder.catalog$lzycompute(SessionState.scala:179)
at org.apache.spark.sql.internal.SessionStateBuilder.catalog(SessionState.scala:179)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:190)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:190)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:193)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:192)
at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.<init>(BaseSessionStateBuilder.scala:208)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:208)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$7(BaseSessionStateBuilder.scala:427)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:106)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:106)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:171)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:352)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:393)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:821)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:393)
at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:389)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:389)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:165)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:165)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:155)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1080)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1080)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:811)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:835)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:578)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"
有时我们会遇到这样的错误:
24/07/11 08:13:28 WARN SparkServiceRPCClient: Fatal connection error for RPC 1257ff66-7657-46de-8c8a-28bad097c6b9
Traceback (most recent call last):
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 67, in <module>
main()
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 31, in main
metric = count_check.run(
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 53, in run
current_df = reduce(
File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 56, in <genexpr>
spark.table(table_info.table_name.format(settings.deploy_env))
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/instrumentation_utils.py", line 48, in wrapper
res = func(*args, **kwargs)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/sql/session.py", line 1423, in table
return DataFrame(self._jsparkSession.table(tableName), self)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/errors/exceptions.py", line 228, in deco
return f(*a, **kw)
File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o26.table.
: com.databricks.service.SparkServiceConnectionException: Request failed with HTTP 404
Client information:
Shard address: "https://adb-xxxxxxxxxxxx.xx.azuredatabricks.net"
Cluster ID: "xxxx-xxxxxx-xxxxxxxx"
Port: 15001
Token ID: "xxxxxxxxxxxxxxx"
Org ID: xxxxxxxxxxxx
Response:
Tunnel f155a135180b448bba783398a66a1878.workerenv-2477805373171457.mux.ngrok-dataplane.wildcard not found
at com.databricks.service.SparkServiceRPCClient.handleResponse(SparkServiceRPCClient.scala:134)
at com.databricks.service.SparkServiceRPCClient.doPost(SparkServiceRPCClient.scala:112)
我们不知道为什么错误如此随机。我们对防火墙/代理/DNS 进行了远程登录,并且它正在工作。工作区 A/B/C 的所有端口均以相同方式打开。全部都在
no_public_ip
中。
我不知道为什么这个Control Plane
没有达到。当它工作时,有时我们在 DBFS + ABFS 上读/写时会超时(正确配置挂载点)
更新1: 当集群运行时,有时在尝试扩展时,无法配置 1 或 2 个节点:
{
"current_num_workers": 17,
"target_num_workers": 19,
"reason": {
"code": "CONTROL_PLANE_REQUEST_FAILURE",
"type": "CLOUD_FAILURE",
"parameters": {
"databricks_error_message": "Failed to get instance bootstrap steps from the Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap failed command: GetRunbook
Failure message: (Base64 encoded) xxxxxxxxxxxxxx
VM extension code: ProvisioningState/succeeded
instanceId: InstanceId(34eef4f370e44939b3dbd82a0ddcc7e4)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated):
[Bootstrap Event] Command DownloadBootstrapScript finished. Storage Account: arprodwesteua16.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 1.79971
2024/07/11 09:12:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.265645980835
2024/07/11 09:12:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0204498767853
2024/07/11 09:13:11 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0198380947113
2024/07/11 09:13:30 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0234580039978
2024/07/11 09:13:53 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.019935131073
2024/07/11 09:14:25 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0190289020538
2024/07/11 09:15:12 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0213069915771
2024/07/11 09:16:31 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0221757888794
2024/07/11 09:18:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0190780162811
2024/07/11 09:23:26 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0225570201874
2024/07/11 09:23:41 INFO vm_bootstrap.py:1242: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.955144167
2024/07/11 09:23:41 INFO vm_bootstrap.py:244: [Bootstrap Event] {FAILED_COMMAND:GetRunbook}
2024/07/11 09:23:41 INFO vm_bootstrap.py:246: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) xxxxxxxxxxxxxxxxx }
2024/07/11 09:23:41 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0183608531952
",
"instance_id": "34eef4f370e44939b3dbd82a0ddcc7e4"
}
}
}
您找到解决这个问题的方法了吗?我有完全相同的问题。所有防火墙端口都已打开,我们在其他环境中也有类似的设置,但这个似乎不起作用。 防火墙日志没有显示任何丢弃的数据包等,因此没有想法。