我有一个 MPI 程序,如果我在一个处理器上运行它,而不是在 8 个处理器上运行它,我会遇到“malloc():检测到未对齐的 tcache 块”错误。内存分配如下所示:
ALLOCATE(XPOINTS((Npx+1)))
IF(MY_RANK .eq. 0) WRITE(*,*) "TESTING"
ALLOCATE(YPOINTS((Npy+1)))
ALLOCATE(ZPOINTS((Npz+1)))
ALLOCATE(x_GLBL((1-Ngl):(Nx_glbl+Ngl)))
ALLOCATE(y_GLBL((1-Ngl):(Ny_glbl+Ngl)))
ALLOCATE(z_GLBL((1-Ngl):(Nz_glbl+Ngl)))
注意,我已经验证了所有分配的数字都是整数。 这是我看到的错误:
TESTING
malloc(): unaligned tcache chunk detected
malloc(): unaligned tcache chunk detected
Program received signal SIGABRT: Process abort signal.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Backtrace for this error:
#0 0x7f2145348960 in ???
#1 0x7f2145347ac5 in ???
#2 0x7f214513e51f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x7f21451929fc in __pthread_kill_implementation
at ./nptl/pthread_kill.c:44
#4 0x7f21451929fc in __pthread_kill_internal
at ./nptl/pthread_kill.c:78
#5 0x7f21451929fc in __GI___pthread_kill
at ./nptl/pthread_kill.c:89
#6 0x7f214513e475 in __GI_raise
at ../sysdeps/posix/raise.c:26
#7 0x7f21451247f2 in __GI_abort
at ./stdlib/abort.c:79
#8 0x7f2145185675 in __libc_message
at ../sysdeps/posix/libc_fatal.c:155
#9 0x7f214519ccfb in malloc_printerr
at ./malloc/malloc.c:5664
#10 0x7f21451a13db in tcache_get
at ./malloc/malloc.c:3195
#11 0x7f21451a13db in __GI___libc_malloc
at ./malloc/malloc.c:3313
#12 0x55ecaeda5ab3 in ???
#13 0x55ecaed90452 in ???
#14 0x55ecaed902ee in ???
#15 0x7f2145125d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#16 0x7f2145125e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#17 0x55ecaed90324 in ???
#18 0xffffffffffffffff in ???
#0 0x7efe26f48960 in ???
#1 0x7efe26f47ac5 in ???
#2 0x7efe26d3e51f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x7efe26d929fc in __pthread_kill_implementation
at ./nptl/pthread_kill.c:44
#4 0x7efe26d929fc in __pthread_kill_internal
at ./nptl/pthread_kill.c:78
#5 0x7efe26d929fc in __GI___pthread_kill
at ./nptl/pthread_kill.c:89
#6 0x7efe26d3e475 in __GI_raise
at ../sysdeps/posix/raise.c:26
#7 0x7efe26d247f2 in __GI_abort
at ./stdlib/abort.c:79
#8 0x7efe26d85675 in __libc_message
at ../sysdeps/posix/libc_fatal.c:155
#9 0x7efe26d9ccfb in malloc_printerr
at ./malloc/malloc.c:5664
#10 0x7efe26da13db in tcache_get
at ./malloc/malloc.c:3195
#11 0x7efe26da13db in __GI___libc_malloc
at ./malloc/malloc.c:3313
#12 0x55fa223ddab3 in ???
#13 0x55fa223c8452 in ???
#14 0x55fa223c82ee in ???
#15 0x7efe26d25d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#16 0x7efe26d25e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#17 0x55fa223c8324 in ???
#18 0xffffffffffffffff in ???
以前有人遇到过这种情况吗?我尝试了所有方法,但不明白为什么它不能在少于 8 个处理器上运行。尝试使用 Intel 和 GNU fortran。这是我的笔记本电脑特有的问题吗?
我尝试使用 Intel 和 GNU 编译器。它适用于 8 个处理器,但不适用于 1 个处理器。
编辑:我无法在更简单的程序中重现此错误,因此我附加了 git hub 存储库:https://github.com/SahajSJain/MyPoisonX.git
消息
malloc(): unaligned tcache chunk detected
是来自 allocate
底层实现的错误消息。在您的情况下, malloc 的实现似乎在堆分配旁边存储有关分配块的附加元信息。在分配期间,malloc 检测到此元数据已损坏,这通常是由于对另一个分配的越界写入引起的。
AddressSanitizer 和 valgrind 是在执行期间检测此类越界访问的工具。我尝试使用 gfortran 和 OpenMPI 编译您的代码。编译器抱怨对
MPI_Cart_create
和 MPI_Cart_coords
的调用与声明不匹配。 PeriodicArr
必须声明 LOGICAL
。对 MPI_Cart_coords
的调用缺少 ierror
参数。
要使用 AddressSanitizer,请将
-fsanitize=address
添加到 CFLAGS 和 LFLAGS。
使用
mpirun -np 2 env ASAN_OPTIONS="detect_leaks=0" ./MyPoisonX
执行然后报告:
At line 199 of file CODE.SETUP_FIELD_VARIABLES.F90
Fortran runtime error: Index '27' of dimension 1 of array 'dxinv' above upper bound of 26
禁用泄漏检查对于避免大量与 MPI 相关的内存泄漏充斥屏幕是必要的。
我无法在我的系统上重现该错误,但缺少的
ierror
参数可能已经解释了该问题。