我正在使用
Metaflow
来编排机器学习模型的训练管道,其范围是将 Metaflow
与 Databricks MLflow
结合起来以监控 ML。下面粘贴了 Metaflow
管道以及响应和 Metaflow
验证。
MLFLOW_TRACKING_URI
设置为“databricks
”。最后的错误是:RuntimeError: Failed to connect to MLflow server databricks.
我的配置中缺少什么?我正在使用
Databricks
集群和 runtime version:16.0 ML (includes Apache Spark 3.5.0, Scala 2.12)
。 将 Databricks MLflow
与 Metaflow
集成的最佳方式是什么?
@step
def start(self):
"""Start and prepare the Training pipeline."""
import mlflow
self.mlflow_tracking_uri = os.getenv("MLFLOW_TRACKING_URI")
logging.info("MLFLOW_TRACKING_URI: %s", self.mlflow_tracking_uri)
mlflow.set_tracking_uri(self.mlflow_tracking_uri)
self.mode = "production" if current.is_production else "development"
logging.info("Running flow in %s mode.", self.mode)
logging.info("The metaflow id is %s ", current.run_id)
self.data = self.load_dataset()
try:
# Let's start a new MLFlow run to track everything that happens during the
# execution of this flow. We want to set the name of the MLFlow
# experiment to the Metaflow run identifier so we can easily
# recognize which experiment corresponds with each run.
run = mlflow.start_run(run_name="current.run_id")
self.mlflow_run_id = run.info.run_id
except Exception as e:
message = f"Failed to connect to MLflow server {self.mlflow_tracking_uri}."
raise RuntimeError(message) from e
# This is the configuration we'll use to train the model. We want to set it up
# at this point so we can reuse it later throughout the flow.
self.training_parameters = {
"epochs": TRAINING_EPOCHS,
"batch_size": TRAINING_BATCH_SIZE,
}
# Now that everything is set up, we want to run a cross-validation process
# to evaluate the model and train a final model on the entire dataset. Since
# these two steps are independent, we can run them in parallel.
self.next(self.cross_validation, self.transform)
%sh
python3 ml.school/pipelines/training.py --environment=pypi run
Metaflow 2.12.39 executing Training for user:test
Project: penguins, Branch: test
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2024-12-17 09:23:59.113 Bootstrapping virtual environment(s) ...
2024-12-17 09:23:59.691 Virtual environment(s) bootstrapped!
Including file ml.school/data/penguins.csv of size 13KB
2024-12-17 09:24:03.265 Workflow starting (run-id 1734427440104542):
2024-12-17 09:24:13.977 [1734427440104542/start/1 (pid 10535)] Task is starting.
2024-12-17 09:24:19.944 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] MLFLOW_TRACKING_URI: databricks
2024-12-17 09:24:19.944 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] Running flow in development mode.
2024-12-17 09:24:20.068 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] The metaflow id is 1734427440104542
2024-12-17 09:24:20.068 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:20,068 [INFO] Loaded dataset with 344 samples
2024-12-17 09:24:23.140 [1734427440104542/start/1 (pid 10535)] <flow Training step start> failed:
2024-12-17 09:24:29.517 [1734427440104542/start/1 (pid 10535)] Internal error
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] Traceback (most recent call last):
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] File "/Workspace/Users/test/ML-End-to-End/ml.school/pipelines/training.py", line 97, in start
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] run = mlflow.start_run(run_name="current.run_id")
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 418, in start_run
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] active_run_obj = client.create_run(
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/client.py", line 393, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 168, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return self.store.create_run(
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/store/tracking/rest_store.py", line 209, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] response_proto = self._call_endpoint(CreateRun, req_body)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/store/tracking/rest_store.py", line 82, in _call_endpoint
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/utils/rest_utils.py", line 370, in call_endpoint
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] response = verify_rest_response(response, endpoint)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/utils/rest_utils.py", line 240, in verify_rest_response
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] raise RestException(json.loads(response.text))
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)]
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] The above exception was the direct cause of the following exception:
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)]
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] Traceback (most recent call last):
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli.py", line 554, in main
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] start(auto_envvar_prefix="METAFLOW", obj=state)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 829, in __call__
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] return self.main(args, kwargs)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 782, in main
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] rv = self.invoke(ctx)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli_components/utils.py", line 69, in invoke
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 1066, in invoke
2024-12-17 09:24:29.528 [1734427440104542/start/1 (pid 10535)] return ctx.invoke(self.callback, ctx.params)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 610, in invoke
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] return callback(args, kwargs)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/decorators.py", line 21, in new_func
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] return f(get_current_context(), args, kwargs)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli_components/step_cmd.py", line 178, in step
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] task.run_step(
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/task.py", line 653, in run_step
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] self._exec_step_function(step_func)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/task.py", line 62, in _exec_step_function
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] step_function()
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/Workspace/Users/test/ML-End-to-End/ml.school/pipelines/training.py", line 101, in start
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)] raise RuntimeError(message) from e
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)] RuntimeError: Failed to connect to MLflow server databricks.
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)]
2024-12-17 09:24:30.134 [1734427440104542/start/1 (pid 10535)] Task failed.
2024-12-17 09:24:30.372 Workflow failed.
2024-12-17 09:24:30.372 Terminating 0 active tasks...
2024-12-17 09:24:30.372 Flushing logs...
Step failure:
Step start (task-id 1) failed.
可能是错误日志所说的 - 您可以在开始运行之前尝试添加此代码段和实验名称
mlflow.set_experiment("experiment_name")