我正在编写一个程序,使用英特尔的 MKL 进行一些矩阵乘法。我有一个令人沮丧的要求,即仅使用动态内存分配的自定义版本。我知道这通常被认为是一个糟糕的主意,但我使用链接器的
--wrap
功能将 malloc
和 free
与我自己的自定义实现包装起来。总的来说,到目前为止进展顺利。
但是,似乎某些 MKL 代码正在执行动态分配,并且它没有调用我的自定义 malloc。我知道 MKL 还用自己的自定义 malloc 替换了系统 malloc,但在我的程序中,我调用
mkl_disable_fast_mm()
,据我了解,应该关闭 MKL 自定义 malloc 的使用并恢复为系统 malloc 。现在,由于我已经使用自定义 malloc --wrap
ped 系统 malloc,所以我希望在 MKL 进行动态分配时看到我的自定义 malloc 被调用。
当我正常运行程序时(如上所述),我可以看到我的自定义 malloc 在使用 malloc 的任何地方都被调用,除了来自 MKL 内部的调用。
为了增加另一层复杂性,如果我使用 valgrind 运行程序,那么我确实会看到我的自定义 malloc 到处被调用,包括从 MKL 内调用。我意识到 valgrind 也用自己的自定义 malloc 替换 malloc,因此在这种情况下会发生多个级别的 malloc 替换。
我的问题是:如何让 MKL 在进行动态分配时调用我的自定义 malloc。看来这一定是可能的,因为似乎使用 valgrind 可以实现这一点,但我无法在不使用 valgrind 的情况下找到方法。
我整理了一个非常简单的例子来演示我上面试图描述的内容:
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
//Typedef some function pointer types here for malloc and free
typedef void* (*MallocFptr)(size_t);
typedef void (*FreeFptr)(void*);
extern MallocFptr __real_malloc;
extern FreeFptr __real_free;
extern "C" void* __wrap_malloc(size_t numBytes)
{
//Yes, this is a dumb way to do to this, keeping it minimal for demo
static char heapSpace[10000000] = {0};
static size_t heapOff = 0;
fprintf(stderr, "In __wrap_malloc, cur offset: %ld requesting %d!\n", heapOff, numBytes);
void* heapLoc = heapSpace + heapOff;
heapOff += numBytes;
return heapLoc;
}
extern "C" void __wrap_free(void* ptrToFree)
{
fprintf(stderr, "In __wrap_free!\n");
//just a no-op for the minimal demo
}
int main()
{
fprintf(stderr, "Disabling fast memory management for MKL in order to use system malloc and free instead\n");
int disableFastMMReturnVal = mkl_disable_fast_mm();
fprintf(stderr, " --> Reports value of %d (1 should mean MKL memory management turned off successfully)\n", disableFastMMReturnVal);
//Use malloc to allocate a small array of chars
char* tmpPtr;
tmpPtr = (char*)malloc(4 * sizeof(char));
tmpPtr[0] = 'f'; tmpPtr[1] = 'o'; tmpPtr[2] = 'o'; tmpPtr[3] = '\0';
//Use malloc to allocate another small array of chars
char* diffPtr;
diffPtr = (char*)malloc(3 * sizeof(char));
diffPtr[0] = 'h'; diffPtr[1] = 'i'; diffPtr[2] = '\0';
//See that data is as expected
fprintf(stderr, "TEMPPTR: %s DIFFPTR: %s\n", tmpPtr, diffPtr);
//Just a no-op for this demo, but see that the wrapped free gets called
free(diffPtr);
free(tmpPtr);
//Now, set up a MKL matrix multiply call:
const int M = 128;
const int K = 128;
const int N = 128;
const float alpha = 1.0;
const float beta = 0.0;
float A[M * K];
float B[K * N];
float C[M * N];
//Initialize the input matrices to known values
for (int r = 0; r < M; r++)
for (int c = 0; c < K; c++)
A[r * K + c] = r * c;
for (int r = 0; r < K; r++)
for (int c = 0; c < N; c++)
B[r * N + c] = r + c;
fprintf(stderr, "START CALL TO cblas_sgemm\n");
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, alpha, A, K, B, N, beta, C, N);
fprintf(stderr, "FINISHED CALL TO cblas_sgemm\n");
//Print some values from the output to check consistent results from run to run
fprintf(stderr, "[0][0]: %f\n", C[0 * N + 0]);
fprintf(stderr, "[20][20]: %f\n", C[20 * N + 20]);
fprintf(stderr, "[40][40]: %f\n", C[40 * N + 40]);
fprintf(stderr, "[100][100]: %f\n", C[100 * N + 100]);
return 0;
}
这是我在没有 valgrind 的情况下运行时的输出:
# ./demo.exe
Disabling fast memory management for MKL in order to use system malloc and free instead
--> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
这是我使用 valgrind 运行时的输出:
# valgrind ./demo.exe
==487722== Memcheck, a memory error detector
==487722== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==487722== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==487722== Command: ./demo.exe
==487722==
Disabling fast memory management for MKL in order to use system malloc and free instead
--> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 7 requesting 4344376!
In __wrap_malloc, cur offset: 4344383 requesting 69664!
In __wrap_malloc, cur offset: 4414047 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
==487722==
==487722== HEAP SUMMARY:
==487722== in use at exit: 0 bytes in 0 blocks
==487722== total heap usage: 1 allocs, 1 frees, 72,704 bytes allocated
==487722==
==487722== All heap blocks were freed -- no leaks are possible
==487722==
==487722== For lists of detected and suppressed errors, rerun with: -s
==487722== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
请注意,如果没有 valgrind,在调用 cblas_sgemm 期间不会调用
__wrap_malloc
,但使用 valgrind 时,会调用 3 次。
编辑#2: 正如 @Andrew Henle 所建议的,可能还有其他分配函数,所以我还为
calloc
和 realloc
添加了一个包装器。添加这两个新包装器后,它会产生如上所示的确切结果。接下来,我在可执行文件上运行 nm
(所有内容都是静态链接的),我得到以下信息:
nm -C demo.exe |& grep alloc
00000000004021f0 T cblas_xerbla_malloc_error
0000000000c7b008 D i_calloc
0000000000c7b000 D i_malloc
0000000000c7b010 D i_realloc
0000000001f94d40 b mkl_hbw_malloc_psize
00000000004438f0 T mkl_serv_allocate
000000000044a550 T mkl_serv_calloc
0000000000446030 T mkl_serv_deallocate
000000000044a660 T mkl_serv_jit_alloc
0000000000444b10 T mkl_serv_malloc
0000000000449860 T mkl_serv_realloc
0000000000443710 t mm_internal_malloc
0000000000442df0 t mm_internal_realloc
0000000001f94d50 b sys_alloc
0000000001f94d68 b sys_allocate
0000000001f94d70 b sys_deallocate
0000000001f94d58 b sys_realloc
0000000000401442 T __wrap_calloc
00000000004013e6 T __wrap_malloc
00000000004014b2 T __wrap_realloc
0000000001f8dd80 b __wrap_calloc::heapOff
0000000001604700 b __wrap_calloc::heapSpace
00000000016046e0 b __wrap_malloc::heapOff
0000000000c7b060 b __wrap_malloc::heapSpace
摘自 David Agans 9 条调试规则书:停止思考,看看。
第 0 步,复制您的输出
MKL=/opt/intel/oneapi/mkl/2024.1
g++ -pthread -g mkl.c -I ${MKL}/include -static -L ${MKL}/lib \
-lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lmkl_core -ldl \
$(for f in malloc free; do echo -Wl,--wrap,$f; done)
./a.out
...
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 1189 requesting 47!
In __wrap_malloc, cur offset: 1236 requesting 24!
In __wrap_free!
In __wrap_malloc, cur offset: 1260 requesting 51!
In __wrap_free!
...
In __wrap_malloc, cur offset: 777399 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
...
嗯,我在那一步失败了:-(
可能原因:
g++ (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
和 GNU ld version 2.40-14.fc39
)。如果我能够复制你的行为,我的下一步将是:在 GDB 下运行程序,在所有分配函数上设置断点,禁用它们。在
cblas_sgemm
上设置断点。一旦命中该断点,重新启用所有其他断点,一旦命中其中一个断点,请使用 (gdb) where
找出未拦截的调用来自何处。
之后,为了弄清楚为什么它没有被拦截,我会检查调用函数,使用
.o
等查看其在 readelf -Wr foo.o
文件中的重定位记录。