我想根据我的用例修改
azureml.data.dataset_factory.register_pandas_dataframe()
,以便除了默认情况下的 relative_path_with_guid
之外它还会返回 registered_dataset
。
默认的
azureml.data.dataset_factory.register_pandas_dataframe()
函数定义是
@staticmethod
@track(_get_logger, custom_dimensions={'app_name': 'TabularDataset'}, activity_type=_PUBLIC_API)
def register_pandas_dataframe(dataframe, target, name, description=None, tags=None, show_progress=True):
"""Create a dataset from pandas dataframe.
:param dataframe: Required, in memory dataframe to be uploaded.
:type dataframe: pandas.DataFrame
:param target: Required, the datastore path where the dataframe parquet data will be uploaded to.
A guid folder will be generated under the target path to avoid conflict.
:type target: typing.Union[azureml.data.datapath.DataPath, azureml.core.datastore.Datastore,
tuple(azureml.core.datastore.Datastore, str)]
:param name: Required, the name of the registered dataset.
:type name: str
:param description: Optional. A text description of the dataset. Defaults to None.
:type description: str
:param tags: Optional. Dictionary of key value tags to give the dataset. Defaults to None.
:type tags: dict[str, str]
:param show_progress: Optional, indicates whether to show progress of the upload in the console.
Defaults to be True.
:type show_progress: bool
:return: The registered dataset.
:rtype: azureml.data.TabularDataset
"""
import pandas as pd
from azureml.data.datapath import DataPath
from uuid import uuid4
console = get_progress_logger(show_progress)
console("Validating arguments.")
_check_type(dataframe, "dataframe", pd.core.frame.DataFrame)
_check_type(name, "name", str)
datastore, relative_path = parse_target(target, True)
console("Arguments validated.")
guid = uuid4()
relative_path_with_guid = "%s/%s/" % (relative_path, guid)
console("Successfully obtained datastore reference and path.")
console("Uploading file to {}".format(relative_path_with_guid))
sanitized_df = _sanitize_pandas(dataframe)
dflow = dataprep().read_pandas_dataframe(df=sanitized_df, in_memory=True)
target_directory_path = DataReference(datastore=datastore).path(relative_path_with_guid)
dflow.write_to_parquet(directory_path=target_directory_path).run_local()
console("Successfully uploaded file to datastore.")
console("Creating and registering a new dataset.")
datapath = DataPath(datastore, relative_path_with_guid)
saved_dataset = TabularDatasetFactory.from_parquet_files(datapath)
registered_dataset = saved_dataset.register(datastore.workspace, name,
description=description,
tags=tags,
create_new_version=True)
console("Successfully created and registered a new dataset.")
return registered_dataset
我了解到更改源代码不是一个好的做法,我应该在开发模式下更改包。即使有一个选项可以做到这一点,我也不知道在哪里可以找到 azureml-sdk 包的 setup.py 。我遇到错误时
pip install azureml-sdk -e /path/to/azureml-dev/folder
ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /path
/to/azureml-dev/folder
我想知道是否有人在调整 azureml-sdk 时做过类似的实验。您是如何解决 setup.py 问题的?
由于azureml sdk-v2是一个闭源Python模块,因此它的代码无法修改。