通过参考下面的示例链接-https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb,我试图使用MPIStep管道类通过azure ML管道运行分布式python作业。
我尝试实现相同的功能,但是即使我更改了MpiStep类中的节点计数参数,在运行脚本时,它也会始终将大小(即comm.Get_size())显示为1。您能帮我解决我在这里所缺少的吗?集群上是否需要任何特定的设置?
代码段:
管道代码段:
model_dir = model_ds.path('./'+saved_model_blob+'/',data_reference_name='saved_model_path').as_mount() label_dir = model_ds.path('./'+model_label_blob+'/',data_reference_name='model_label_blob').as_mount() input_images = result_ds.path('./'+score_blob_name+'/',data_reference_name='Input_images').as_mount() output_container = 'abc' inti_container = 'xyz' distributed_batch_score_step = MpiStep( name="batch_scoring", source_directory=SCRIPT_FOLDER, script_name="batch_scoring_script_mpi.py", arguments=["--dataset_path", input_images, "--model_name", model_dir, "--label_dir", label_dir, "--intermediate_data_container", inti_container, "--output_container", output_container], compute_target=gpu_cluster, inputs=[input_images, model_dir,label_dir], pip_packages=["tensorflow","tensorflow-gpu==1.13.1","pillow","azure-keyvault","azure-storage-blob"], conda_packages=["mesa-libgl-cos6-x86_64","mpi4py==3.0.2","opencv=3.4.2","scikit-learn=0.21.2"], use_gpu=True, allow_reuse = False, node_count = nodecount_param, process_count_per_node = 1 )
Python脚本代码段:
def run(input_dataset,comm):
rank = comm.Get_rank()
size = comm.Get_size()
print("Rank:" , rank)
print("Size:", size) # shows always 1, even the input node count is >1
print(MPI.Get_processor_name())
file_names = get_file_names(args.dataset_path)
sorted(file_names)
partition_size = len(file_names) // size
print("partition_size-->",partition_size)
partitioned_filenames = file_names[rank * partition_size: (rank + 1) * partition_size]
print("RANK {} - is processing {} images out of the total {}".format(rank, len(partitioned_filenames),
len(file_names)))
# call to Function 01
# call to Function 02
img_names = score_df['image_name'].unique()
output_batch = pd.DataFrame()
for i in img_names:
# call to Function 3
output_batch = output_batch.append(pp_output, ignore_index=True)
output_paths_list = comm.gather(output_batch, root=0)
print("RANK {} - number of pre-aggregated output files {}".format(rank, len(output_batch)))
print("saved in", currentDT + '\\' + 'data.csv')
if rank == 0:
print("RANK {} - number of aggregated output files {}".format(rank, len(output_paths_list)))
print("RANK {} - end".format(rank))
if __name__ == "__main__":
with tf.device('/GPU:0'):
init()
comm = MPI.COMM_WORLD
run(args.dataset_path,comm)
我正在尝试通过使用以下示例链接通过MPIStep管道类通过azure ML管道运行分布式python作业-https://github.com/Azure/MachineLearningNotebooks/blob/master / ...
要知道问题是由于软件包版本引起的,较早的版本是通过conda_packages = [“ mpi4py == 3.0.2”通过conda安装的,通过pip更改安装后才起作用-pip_packages = [“ mpi4py”]] >