在MPI Azure ml管道中运行MPI python脚本

问题描述 投票:0回答:1

通过参考下面的示例链接-https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb,我试图使用MPIStep管道类通过azure ML管道运行分布式python作业。

我尝试实现相同的功能,但是即使我更改了MpiStep类中的节点计数参数,在运行脚本时,它也会始终将大小(即comm.Get_size())显示为1。您能帮我解决我在这里所缺少的吗?集群上是否需要任何特定的设置?

代码段:

管道代码段:

model_dir = model_ds.path('./'+saved_model_blob+'/',data_reference_name='saved_model_path').as_mount()
label_dir = model_ds.path('./'+model_label_blob+'/',data_reference_name='model_label_blob').as_mount()

input_images = result_ds.path('./'+score_blob_name+'/',data_reference_name='Input_images').as_mount()

output_container = 'abc'
inti_container = 'xyz'



distributed_batch_score_step = MpiStep(
    name="batch_scoring",
    source_directory=SCRIPT_FOLDER,
    script_name="batch_scoring_script_mpi.py",
    arguments=["--dataset_path", input_images, 
               "--model_name", model_dir,
               "--label_dir", label_dir, 
               "--intermediate_data_container", inti_container, 
               "--output_container", output_container],
    compute_target=gpu_cluster,
    inputs=[input_images, model_dir,label_dir],
    pip_packages=["tensorflow","tensorflow-gpu==1.13.1","pillow","azure-keyvault","azure-storage-blob"],
    conda_packages=["mesa-libgl-cos6-x86_64","mpi4py==3.0.2","opencv=3.4.2","scikit-learn=0.21.2"],                                 
    use_gpu=True,
    allow_reuse = False,
    node_count = nodecount_param,
    process_count_per_node = 1

)

Python脚本代码段:

def run(input_dataset,comm):

rank = comm.Get_rank()
size = comm.Get_size()
print("Rank:" , rank)
print("Size:", size) # shows always 1, even the input node count is >1
print(MPI.Get_processor_name())


file_names = get_file_names(args.dataset_path)
sorted(file_names)


partition_size = len(file_names) // size
print("partition_size-->",partition_size)
partitioned_filenames = file_names[rank * partition_size: (rank + 1) * partition_size]
print("RANK {}  - is processing {} images out of the total {}".format(rank, len(partitioned_filenames),
                                                                     len(file_names)))

# call to Function 01

# call to Function 02

img_names = score_df['image_name'].unique()
output_batch = pd.DataFrame()
for i in img_names:
    # call to Function 3
    output_batch = output_batch.append(pp_output, ignore_index=True)
    output_paths_list = comm.gather(output_batch, root=0)



print("RANK {} - number of pre-aggregated output files {}".format(rank, len(output_batch)))

print("saved in", currentDT + '\\' + 'data.csv')

if rank == 0:
    print("RANK {} - number of aggregated output files {}".format(rank, len(output_paths_list)))
    print("RANK {} - end".format(rank))

if __name__ == "__main__":
    with tf.device('/GPU:0'):
        init()
        comm = MPI.COMM_WORLD
        run(args.dataset_path,comm)

我正在尝试通过使用以下示例链接通过MPIStep管道类通过azure ML管道运行分布式python作业-https://github.com/Azure/MachineLearningNotebooks/blob/master / ...

python-3.x azure-pipelines mpi4py azure-machine-learning-service
1个回答
0
投票

要知道问题是由于软件包版本引起的,较早的版本是通过conda_packages = [“ mpi4py == 3.0.2”通过conda安装的,通过pip更改安装后才起作用-pip_packages = [“ mpi4py”]] >

© www.soinside.com 2019 - 2024. All rights reserved.