如果您能帮我解决以下情况,我将不胜感激。
我通过连续两个步骤广播(在本地主机上)一个字符数组:
MPI_Bcast 数组的大小
MPI_Bcast 数组本身
这是通过动态进程生成来完成的。数据通信工作得很好,直到数组的大小(大约)超过 8375000 个元素的数量。这是 8.375Mb 的数据,根据现有文档,这似乎相当小。据我在其他文章中读到,MPI 最多支持 2^31 个元素。超过 8375000 个元素后,我收到带有 EXIT CODE: 139
的 MPI 错误。
上测试了代码。该摘要并未表明存在令人担忧的情况,但我收到了各种以 Syscall param writev(vector[...]) points to uninitialised byte(s)
开头的 MPI 相关错误。这是日志的尾部。
...
==15125== Syscall param writev(vector[...]) points to uninitialised byte(s)
==15125== at 0x5B83327: writev (writev.c:26)
==15125== by 0x8978FF1: MPL_large_writev (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x8961D4B: MPID_nem_tcp_iStartContigMsg (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x8939E15: MPIDI_CH3_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x895DD69: MPID_nem_lmt_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x8945FE9: MPID_Send (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x88B1D84: MPIC_Send (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x886EC08: MPIR_Bcast_inter_remote_send_local_bcast (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x87C28F2: MPIR_Bcast_impl (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x87C3183: PMPI_Bcast (in /usr/lib/libmpi.so.12.1.6)
==15125== by 0x50B9CF5: QuanticBoost::Calculators::Exposures::Mpi::dynamic_mpi_master(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, QuanticBoost::WorkflowContext&) (in /home/ubuntu/Documents/code/quanticboostnew-build/release/lib/libCppLib.so)
==15125== by 0x5140CBA: QuanticBoost::MpiExposureSpawnTask::execute(QuanticBoost::WorkflowContext&) (in /home/ubuntu/Documents/code/quanticboostnew-build/release/lib/libCppLib.so)
==15125== Address 0x1ffefff524 is on thread 1's stack
==15125== in frame #3, created by MPIDI_CH3_RndvSend (???:)
==15125== Uninitialised value was created by a stack allocation
==15125== at 0x8939D70: MPIDI_CH3_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125==
==15125==
==15125== HEAP SUMMARY:
==15125== in use at exit: 184 bytes in 6 blocks
==15125== total heap usage: 364,503 allocs, 364,497 frees, 204,665,377 bytes allocated
==15125==
==15125== LEAK SUMMARY:
==15125== definitely lost: 0 bytes in 0 blocks
==15125== indirectly lost: 0 bytes in 0 blocks
==15125== possibly lost: 0 bytes in 0 blocks
==15125== still reachable: 184 bytes in 6 blocks
==15125== suppressed: 0 bytes in 0 blocks
==15125== Reachable blocks (those to which a pointer was found) are not shown.
==15125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==15125==
==15125== For counts of detected and suppressed errors, rerun with: -v
==15125== ERROR SUMMARY: 15 errors from 10 contexts (suppressed: 0 from 0)
你能帮我识别 vagrind 错误并解决代码 139 的 MPI 故障吗?下面我分享了
和 worker 代码的最小代码片段,以及 error 代码的输出。 代码片段(
master):
std::cout << "Spawning "<< dynamic_procs << " " << worker_path.string() <<std::endl;
MPI_Comm_spawn(
worker_path.string().c_str(),
MPI_ARGV_NULL,
dynamic_procs,
info,
0,
MPI_COMM_SELF, //intra-communication
&intercomm, //inter-communication
MPI_ERRCODES_IGNORE);
std::cout << "\n________________ MASTER: MPI spawning starts _________________ \n" << std::endl;
// I normally send the size of the char array in the 1st Bcast
// and the array itself in a 2nd Bcast
//
// but MPI starts failing somewhere beyond 8.375e6 elements
// though I expect that happening after 2^31 array elements, or not???
//I test the limits of the array size by overriding manually
int in_str_len=8.375e6; //Until this size it all works
//int in_str_len=8.376e6; //This does NOT work
//int in_str_len=8.3765e6; //This does NOT work and so on
MPI_Bcast(
&in_str_len, //void* data,
1, //int count,
MPI_INT, //MPI_Datatype datatype,
MPI_ROOT, //int use MPI_ROOT not own set root!
intercomm //MPI_Comm communicator
);
//Initialize a test buffer
std::string s (in_str_len, 'x'); //It works
//char d[in_str_len+1]; //It works
/*
* The 2nd MPI_Bcast will send the data to all nodes
*/
MPI_Bcast(
s.data(), //void* data,
in_str_len, //int count,
MPI_BYTE, //MPI_Datatype datatype, MPI_BYTE,MPI_CHAR work
MPI_ROOT, //int use MPI_ROOT not own set root!
intercomm //MPI_Comm communicator
);
代码片段(worker
):
std::cout << "I am in a spawned process " << rank << "/" << dynamic_procs
<< " from host " << name << std::endl;
int in_str_len;
//Receive stream size;
MPI_Bcast(
&in_str_len, //void* data,
1, //int count,
MPI_INT, //MPI_Datatype datatype,
0, //int root,
parent //MPI_Comm communication with parent (not MPI_COMM_WORLD)
);
std::cout << "1st MPI_Bcast received len: "<< in_str_len * 1e-6<<"Mb" << std::endl;
MPI_Barrier(MPI_COMM_WORLD); //Tested with and without the barrier
char data[in_str_len+1];
std::cout << "Create char array for 2nd MPI_Bcast with length: "<< in_str_len << std::endl;
MPI_Bcast(
data, //void* data,
in_str_len, //int count,
MPI_BYTE, //MPI_Datatype datatype,
0, //int root,
parent //MPI_Comm communication with parent (not MPI_COMM_WORLD)
);
std::cout << "2nd MPI_Bcast received data: " << sizeof(data) << std::endl;
:
Spawning 3 /home/ubuntu/Documents/code/build/release/bin/mpi_worker
________________ MASTER: MPI spawning starts _________________
I am in a spawned process 1/3 from host ip-172-31-30-254
I am in a spawned process 0/3 from host ip-172-31-30-254
I am in a spawned process 2/3 from host ip-172-31-30-254
1st MPI_Bcast received len: 8.3765Mb
1st MPI_Bcast received len: 8.3765Mb
1st MPI_Bcast received len: 8.3765Mb
Create char array for 2nd MPI_Bcast with length: 8376500
Create char array for 2nd MPI_Bcast with length: 8376500
Create char array for 2nd MPI_Bcast with length: 8376500
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 9690 RUNNING AT localhost
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
PS:如果您需要任何额外信息或对我的帖子进行进一步编辑,请告诉我。
首先它说了一些关于未初始化数据的事情。那是真实的。您正在发送一个尚未填写的数组。我将省去您的技术细节,但您可以在谷歌上搜索“页面实例化”。基本上,如果不初始化的话,内存还不存在。
char data[somethingbig]
创建数组。那就是
不是语言标准(实际上,它是在 C99 中,然后在 C11 中再次弃用;基本上:不要这样做。)malloc
。
哦,你把这个标记为“c++”。在这种情况下,仅对大型数组使用std::vector
。没有理由使用其他任何东西,正如您所看到的,很多人不这样做。