Azure Databricks“无法从 Databricks 控制平面获取实例引导步骤”

问题描述 投票:0回答:1

我有一个 terraform 代码来部署 Databricks 工作区。

resource "azurerm_databricks_workspace" "databricks" {
  resource_group_name = var.resource_group_name
  location            = var.context.location # West Europe

  name                        = local.dbw_name
  managed_resource_group_name = local.mrg_name

  sku = "premium" # Needed for private endpoint

  public_network_access_enabled         = false
  network_security_group_rules_required = "NoAzureDatabricksRules" # https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/private-link#--step-3-provision-an-azure-databricks-workspace-and-private-endpoints

  custom_parameters {
    no_public_ip                                         = true # Security constrain 
    virtual_network_id                                   = local.vnet_id
    public_subnet_name                                   = local.container_subnet_name
    public_subnet_network_security_group_association_id  = var.subnet_configuration_for_container.network_security_group_id
    private_subnet_name                                  = local.host_subnet_name
    private_subnet_network_security_group_association_id = var.subnet_configuration_for_host.network_security_group_id
    storage_account_name                                 = local.st_name
  }

  tags = merge(local.default_tags, { managed_databricks = true })

  lifecycle {
    ignore_changes = [
      tags["availability"],
      tags["confidentiality"],
      tags["integrity"],
      tags["spoke_type"],
      tags["traceability"],
    ]
    precondition {
      condition     = length(local.dbw_name) < 64
      error_message = "The Databricks resource name must be no longer than 64 characters. Please shorten the `instance` variable."
    }
  }

  depends_on = [
    data.azapi_resource.subnet["host"],
    data.azapi_resource.subnet["container"]
  ]
}

我们还有 2 个用于 webapp dbw 和 webauth dbw 的专用端点。然后,两者都会在我们的自定义 DNS 中注册,以便主机和容器的 URL/IP 可在我们的网络内访问。

在订阅 A 上部署此代码时,我们遇到了 0 个问题。集群启动正确、快速、无超时。 但是,当在与 A 相同的订阅 B 和 C 上部署时,我们遇到了问题。 A/B/C 之间的唯一区别是名称/ID(相同的tenantID)、策略相同、terraform 代码相同、DNS/防火墙/代理相同。

在订阅 B/C 上,当尝试启动集群/作业/dbt 时,我们在启动它们时遇到问题。但并非总是如此。 这是 Databricks UI 上的错误:

Failed to get instance bootstrap steps from the Databricks Control Plane.
Please check that instances have connectivity to the Databricks Control Plane.
Instance bootstrap failed command: GetRunbook Failure message: (Base64 encoded) XXXXXXXXXXXXXXX
VM extension code: ProvisioningState/succeeded instanceId:
InstanceId(aca79a0fb49e4c808700af118638e8ac)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated): [Bootstrap Event] Command DownloadBootstrapScript finished.
Storage Account: arprodwesteua4.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 4.69204 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.0390388965607 2024/07/10 09:09:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0142250061035 2024/07/10 09:10:11
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0148358345032 2024/07/10 09:10:30
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0136380195618 2024/07/10 09:10:54
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014701128006 2024/07/10 09:11:25
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0169570446014 2024/07/10 09:12:12
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0156190395355 2024/07/10 09:13:31
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0194149017334 2024/07/10 09:15:55
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.014839887619 2024/07/10 09:20:26
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0163550376892 2024/07/10 09:20:41
INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.803928137 2024/07/10 09:20:41
INFO vm_bootstrap.py:240: [Bootstrap Event] {FAILED_COMMAND:GetRunbook} 2024/07/10 09:20:41 
INFO vm_bootstrap.py:242: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) XXXXXXXXXXX } 2024/07/10 09:20:41
INFO vm_bootstrap.py:1224: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0184330940247

当在我的计算机上启动 SparkSession 以到达集群时,我得到以下信息:

24/07/11 08:32:25 WARN HTTPClient: Excluding proxy hosts for HTTP client based on env var no_proxy=localhost,10.0.0.0/8,storageA.blob.core.windows.net,storageB.blob.core.windows.net,storageC.blob.core.windows.net,storageD.blob.core.windows.net,storageE.blob.core.windows.net,storageF.blob.core.windows.net,storageG.blob.core.windows.net,.dev.azuresynapse.net,.azuresynapse.net,.table.core.windows.net,.queue.core.windows.net,.file.core.windows.net,.web.core.windows.net,.dfs.core.windows.net,.documents.azure.com,.batch.azure.com,.service.batch.azure.com,.vault.azure.net,.vaultcore.azure.net,.managedhsm.azure.net,.azmk8s.io,.search.windows.net,.azurecr.io,.azconfig.io,.servicebus.windows.net,.azure-devices.net,.servicebus.windows.net,.azure-devices-provisioning.net,.eventgrid.azure.net,.azurewebsites.net,.scm.azurewebsites.net,.api.azureml.ms,.notebooks.azure.net,.instances.azureml.ms,.aznbcontent.net,.inference.ml.azure.com,.cognitiveservices.azure.com,.afs.azure.net,.datafactory.azure.net,.adf.azure.com,.purview.azure.com,.azure-api.net,.developer.azure-api.net,.analysis.windows.net,.azuredatabricks.net,.dev.azure.com,.azurefd.net,.vsblob.vsassets.io,otr.dtc3.cf.saint-gobain.net,.openai.azure.com
24/07/11 08:32:26 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state TERMINATED, waiting for it to start running...
24/07/11 08:32:37 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:22 WARN SparkServiceRPCClient: Cluster xxxx-xxxxxx-xxxxxxxx in state PENDING, waiting for it to start running...
24/07/11 08:46:32 ERROR SparkClientManager: Fail to get the SparkClient
java.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"

The cluster ID you specified does not correspond to any existing cluster.
Cluster ID: The ID of the cluster on which you want to run your code
  - This should look like 0123-456789-abcd012
  - Get current value: spark.conf.get("spark.databricks.service.clusterId")
  - Set via conf: spark.conf.set("spark.databricks.service.clusterId", <your cluster ID>)
  - Set via environment variable: export DATABRICKS_CLUSTER_ID=<your cluster ID>
      
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
    at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
    at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at org.sparkproject.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
    at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
    at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
    at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2193)
    at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:3932)
    at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
    at com.databricks.service.SparkClientManager.liftedTree1$1(SparkClient.scala:377)
    at com.databricks.service.SparkClientManager.getForSession(SparkClient.scala:376)
    at com.databricks.service.SparkClientManager.getForSession$(SparkClient.scala:353)
    at com.databricks.service.SparkClientManager$.getForSession(SparkClient.scala:401)
    at com.databricks.service.SparkClientManager.getForCurrentSession(SparkClient.scala:351)
    at com.databricks.service.SparkClientManager.getForCurrentSession$(SparkClient.scala:351)
    at com.databricks.service.SparkClientManager$.getForCurrentSession(SparkClient.scala:401)
    at com.databricks.service.SparkClient$.getServerHadoopConf(SparkClient.scala:297)
    at com.databricks.spark.util.SparkClientContext$.getServerHadoopConf(SparkClientContext.scala:281)
    at org.apache.spark.SparkContext.$anonfun$hadoopConfiguration$1(SparkContext.scala:407)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.SparkContext.hadoopConfiguration(SparkContext.scala:398)
    at com.databricks.sql.DatabricksEdge.catalog(DatabricksEdge.scala:198)
    at com.databricks.sql.DatabricksEdge.catalog$(DatabricksEdge.scala:197)
    at org.apache.spark.sql.internal.SessionStateBuilder.catalog$lzycompute(SessionState.scala:179)
    at org.apache.spark.sql.internal.SessionStateBuilder.catalog(SessionState.scala:179)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:190)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:190)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:193)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:192)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.<init>(BaseSessionStateBuilder.scala:208)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:208)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$7(BaseSessionStateBuilder.scala:427)
    at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:106)
    at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:106)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:171)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:352)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:393)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:821)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:393)
    at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:389)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:389)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:165)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:165)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:155)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:100)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
    at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1080)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:24)
    at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1080)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
    at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:811)
    at org.apache.spark.sql.SparkSession.table(SparkSession.scala:835)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
    at java.base/java.lang.reflect.Method.invoke(Method.java:578)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
    at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: com.databricks.service.SparkServiceConnectionException: Invalid cluster ID: "xxxx-xxxxxx-xxxxxxxx"

有时我们会遇到这样的错误:

24/07/11 08:13:28 WARN SparkServiceRPCClient: Fatal connection error for RPC 1257ff66-7657-46de-8c8a-28bad097c6b9
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 67, in <module>
    main()
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/main.py", line 31, in main
    metric = count_check.run(
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 53, in run
    current_df = reduce(
  File "/mnt/azureml/cr/j/1da76efc59a44a94b488740ba8b08bba/exe/wd/core/count_check.py", line 56, in <genexpr>
    spark.table(table_info.table_name.format(settings.deploy_env))
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/sql/session.py", line 1423, in table
    return DataFrame(self._jsparkSession.table(tableName), self)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/azureml-envs/python3.10/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o26.table.
: com.databricks.service.SparkServiceConnectionException: Request failed with HTTP 404
Client information:
Shard address: "https://adb-xxxxxxxxxxxx.xx.azuredatabricks.net"
Cluster ID: "xxxx-xxxxxx-xxxxxxxx"
Port: 15001
Token ID: "xxxxxxxxxxxxxxx"
Org ID: xxxxxxxxxxxx
     
Response:
Tunnel f155a135180b448bba783398a66a1878.workerenv-2477805373171457.mux.ngrok-dataplane.wildcard not found
    at com.databricks.service.SparkServiceRPCClient.handleResponse(SparkServiceRPCClient.scala:134)
    at com.databricks.service.SparkServiceRPCClient.doPost(SparkServiceRPCClient.scala:112)

我们不知道为什么错误如此随机。我们对防火墙/代理/DNS 进行了远程登录,并且它正在工作。工作区 A/B/C 的所有端口均以相同方式打开。全部都在

no_public_ip
中。 我不知道为什么这个
Control Plane
没有达到。当它工作时,有时我们在 DBFS + ABFS 上读/写时会超时(正确配置挂载点)

更新1: 当集群运行时,有时在尝试扩展时,无法配置 1 或 2 个节点:

{
  "current_num_workers": 17,
  "target_num_workers": 19,
  "reason": {
    "code": "CONTROL_PLANE_REQUEST_FAILURE",
    "type": "CLOUD_FAILURE",
    "parameters": {
      "databricks_error_message": "Failed to get instance bootstrap steps from the Databricks Control Plane. Please check that instances have connectivity to the Databricks Control Plane. 
Instance bootstrap failed command: GetRunbook
Failure message: (Base64 encoded) xxxxxxxxxxxxxx 
VM extension code: ProvisioningState/succeeded
instanceId: InstanceId(34eef4f370e44939b3dbd82a0ddcc7e4)
workerEnv: workerenv-2477805373171457
Additional details (may be truncated):

    [Bootstrap Event] Command DownloadBootstrapScript finished. Storage Account: arprodwesteua16.blob.core.windows.net [SUCCEEDED]. Seconds Elapsed: 1.79971
    2024/07/11 09:12:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetToken finished. [SUCCEEDED]. Seconds Elapsed: 0.265645980835
    2024/07/11 09:12:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0204498767853
    2024/07/11 09:13:11 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0198380947113
    2024/07/11 09:13:30 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0234580039978
    2024/07/11 09:13:53 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.019935131073
    2024/07/11 09:14:25 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0190289020538
    2024/07/11 09:15:12 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0213069915771
    2024/07/11 09:16:31 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0221757888794
    2024/07/11 09:18:54 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0190780162811
    2024/07/11 09:23:26 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0225570201874
    2024/07/11 09:23:41 INFO vm_bootstrap.py:1242: [Bootstrap Event] Command GetRunbook finished. [FAILED] . Seconds Elapsed: 646.955144167
    2024/07/11 09:23:41 INFO vm_bootstrap.py:244: [Bootstrap Event] {FAILED_COMMAND:GetRunbook}
    2024/07/11 09:23:41 INFO vm_bootstrap.py:246: [Bootstrap Event] {FAILED_MESSAGE:(Base64 encoded) xxxxxxxxxxxxxxxxx }
    2024/07/11 09:23:41 INFO vm_bootstrap.py:1233: [Bootstrap Event] Command GetInstanceId finished. [SUCCEEDED]. Seconds Elapsed: 0.0183608531952
    ",
      "instance_id": "34eef4f370e44939b3dbd82a0ddcc7e4"
    }
  }
}
databricks azure-databricks
1个回答
0
投票

您找到解决这个问题的方法了吗?我有完全相同的问题。所有防火墙端口都已打开,我们在其他环境中也有类似的设置,但这个似乎不起作用。 防火墙日志没有显示任何丢弃的数据包等,因此没有想法。

© www.soinside.com 2019 - 2024. All rights reserved.