创建MPI结构时出现问题,调用MPI_Bcast时出错11

问题描述 投票:1回答:1

我想在进程之间传输一个结构,为此我试图创建一个MPI结构。该代码用于蚁群优化(ACO)算法。

带有C struct的头文件包含:

    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/time.h>
    #include <math.h>
    #include <mpi.h>

    /* Constants */
    #define NUM_CITIES 100      // Number of cities
    //among others

    typedef struct {
        int city, next_city, tabu[NUM_CITIES], path[NUM_CITIES], path_index;
        double tour_distance;
    } ACO_Ant;

我尝试按照this thread的建议构建我的代码。

程序代码:

    int main(int argc, char *argv[])
    {
    MPI_Datatype MPI_TABU, MPI_PATH, MPI_ANT;

    // Initialize MPI
    MPI_Init(&argc, &argv);
    //Determines the size (&procs) of the group associated with a communicator (MPI_COMM_WORLD)
    MPI_Comm_size(MPI_COMM_WORLD, &procs);
    //Determines the rank (&rank) of the calling process in the communicator (MPI_COMM_WORLD)
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Type_contiguous(NUM_CITIES, MPI_INT, &MPI_TABU);
    MPI_Type_contiguous(NUM_CITIES, MPI_INT, &MPI_PATH);
    MPI_Type_commit(&MPI_TABU);
    MPI_Type_commit(&MPI_PATH);

    // Create ant struct
    //int city, next_city, tabu[NUM_CITIES], path[NUM_CITIES], path_index;
    //double tour_distance;
    int blocklengths[6] = {1,1, NUM_CITIES, NUM_CITIES, 1, 1};
    MPI_Datatype    types[6] = {MPI_INT, MPI_INT, MPI_TABU, MPI_PATH, MPI_INT, MPI_DOUBLE};
    MPI_Aint        offsets[6] = { offsetof( ACO_Ant, city ), offsetof( ACO_Ant, next_city), offsetof( ACO_Ant, tabu), offsetof( ACO_Ant, path ), offsetof( ACO_Ant, path_index ), offsetof( ACO_Ant, tour_distance )};

    MPI_Datatype tmp_type;
    MPI_Aint lb, extent;

    MPI_Type_create_struct(6, blocklengths, offsets, types, &tmp_type);
    MPI_Type_get_extent( tmp_type, &lb, &extent );
    //Tried all of these
    MPI_Type_create_resized( tmp_type, lb, extent, &MPI_ANT );
    //MPI_Type_create_resized( tmp_type, 0, sizeof(MPI_ANT), &MPI_ANT );
    //MPI_Type_create_resized( tmp_type, 0, sizeof(ant), &MPI_ANT );
    MPI_Type_commit(&MPI_ANT);

    printf("Return: %d\n" , MPI_Bcast(ant, NUM_ANTS, MPI_ANT, 0, MPI_COMM_WORLD));
    }

但是一旦程序到达MPI_Bcast命令,它就会崩溃,错误代码为11 我认为是MPI_ERR_TOPOLOGY as per this manual. 是一个段错误(信号11)。

我也不确定原始程序作者的一些代码 - 可以解释他们创建的原因

MPI_Aint displacements[3];
MPI_Datatype typelist[3];

大小为3,当struct有2个变量时?

int block_lengths[2];

码:

    void ACO_Build_best(ACO_Best_tour *tour, MPI_Datatype *mpi_type /*out*/)
    {
        int block_lengths[2];
        MPI_Aint displacements[3];
        MPI_Datatype typelist[3];
        MPI_Aint start_address;
        MPI_Aint address;

        block_lengths[0] = 1;
        block_lengths[1] = NUM_CITIES;

        typelist[0] = MPI_DOUBLE;
        typelist[1] = MPI_INT;

        displacements[0] = 0;

        MPI_Address(&(tour->distance), &start_address);
        MPI_Address(tour->path, &address);
        displacements[1] = address - start_address;

        MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
        MPI_Type_commit(mpi_type);
    }

所有和任何帮助将不胜感激。 编辑:帮助解决问题,而不是简单有用的StackOverflow术语

c struct mpi mpich
1个回答
1
投票

这部分是错的:

int blocklengths[6] = {1,1, NUM_CITIES, NUM_CITIES, 1, 1};
MPI_Datatype    types[6] = {MPI_INT, MPI_INT, MPI_TABU, MPI_PATH, MPI_INT, MPI_DOUBLE};
MPI_Aint        offsets[6] = { offsetof( ACO_Ant, city ), offsetof( ACO_Ant, next_city), offsetof( ACO_Ant, tabu), offsetof( ACO_Ant, path ), offsetof( ACO_Ant, path_index ), offsetof( ACO_Ant, tour_distance )};

MPI_TABUMPI_PATH数据类型已经涵盖了NUM_CITIES元素。当您指定相应的块大小也是NUM_CITIES时,结果数据类型将尝试访问NUM_CITIES * NUM_CITIES元素,可能导致段错误(信号11)。

要么将blocklengths的所有元素都设置为1,要么用MPI_TABU替换MPI_PATH数组中的typesMPI_INT

这部分也错了:

MPI_Type_create_struct(6, blocklengths, offsets, types, &tmp_type);
MPI_Type_get_extent( tmp_type, &lb, &extent );
//Tried all of these
MPI_Type_create_resized( tmp_type, lb, extent, &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(MPI_ANT), &MPI_ANT );
//MPI_Type_create_resized( tmp_type, 0, sizeof(ant), &MPI_ANT );
MPI_Type_commit(&MPI_ANT);

使用MPI_Type_create_resized返回的值调用MPI_Type_get_extent是没有意义的,因为它只是复制了类型而没有实际调整它的大小。使用sizeof(MPI_ANT)是错误的,因为MPI_ANT不是C类型而是MPI句柄,它是整数索引或指针(依赖于实现)。如果sizeof(ant)的类型为ant,它将与ACO_Ant一起使用,但是如果你调用MPI_Bcast(ant, NUM_ANTS, ...),那么ant要么是一个指针,在这种情况下sizeof(ant)只是指针大小,或者它是一个数组,在这种情况下sizeof(ant)NUM_ANTS大于肯定是。正确的电话是:

MPI_Type_create_resized(tmp_type, 0, sizeof(ACO_Ant), &ant_type);
MPI_Type_commit(&ant_type);

并且,请不要在您自己的变量或函数名称中使用MPI_作为前缀。这使得代码不可读并且非常误导(“是预定义的MPI数据类型还是用户定义的数据类型?”)

至于最后一个问题,作者可能会考虑不同的结构。只要您使用正确数量的重要元素调用MPI_Type_create,就没有什么能阻止您使用更大的数组。

注意:您不必提交从未在通信调用中直接使用的MPI数据类型。即,这两行是不必要的:

MPI_Type_commit(&MPI_TABU);
MPI_Type_commit(&MPI_PATH);
© www.soinside.com 2019 - 2024. All rights reserved.