在 MKL 中使用自定义 malloc 实现

Question

我正在编写一个程序，使用英特尔的 MKL 进行一些矩阵乘法。我有一个令人沮丧的要求，即仅使用动态内存分配的自定义版本。我知道这通常被认为是一个糟糕的主意，但我使用链接器的

--wrap

功能将

malloc

和

free

与我自己的自定义实现包装起来。总的来说，到目前为止进展顺利。

但是，似乎某些 MKL 代码正在执行动态分配，并且它没有调用我的自定义 malloc。我知道 MKL 还用自己的自定义 malloc 替换了系统 malloc，但在我的程序中，我调用

mkl_disable_fast_mm()

，据我了解，应该关闭 MKL 自定义 malloc 的使用并恢复为系统 malloc 。现在，由于我已经使用自定义 malloc

--wrap

ped 系统 malloc，所以我希望在 MKL 进行动态分配时看到我的自定义 malloc 被调用。

当我正常运行程序时（如上所述），我可以看到我的自定义 malloc 在使用 malloc 的任何地方都被调用，除了来自 MKL 内部的调用。

为了增加另一层复杂性，如果我使用 valgrind 运行程序，那么我确实会看到我的自定义 malloc 到处被调用，包括从 MKL 内调用。我意识到 valgrind 也用自己的自定义 malloc 替换 malloc，因此在这种情况下会发生多个级别的 malloc 替换。

我的问题是：如何让 MKL 在进行动态分配时调用我的自定义 malloc。看来这一定是可能的，因为似乎使用 valgrind 可以实现这一点，但我无法在不使用 valgrind 的情况下找到方法。

我整理了一个非常简单的例子来演示我上面试图描述的内容：

#include <stdio.h>
#include <stdlib.h>

#include "mkl.h"

//Typedef some function pointer types here for malloc and free
typedef void* (*MallocFptr)(size_t);
typedef void (*FreeFptr)(void*);

extern MallocFptr __real_malloc;
extern FreeFptr __real_free;

extern "C" void* __wrap_malloc(size_t numBytes)
{
  //Yes, this is a dumb way to do to this, keeping it minimal for demo
  static char heapSpace[10000000] = {0};
  static size_t heapOff = 0;

  fprintf(stderr, "In __wrap_malloc, cur offset: %ld requesting %d!\n", heapOff, numBytes);

  void* heapLoc = heapSpace + heapOff;
  heapOff += numBytes;

  return heapLoc;
}

extern "C" void __wrap_free(void* ptrToFree)
{
  fprintf(stderr, "In __wrap_free!\n");
  //just a no-op for the minimal demo
}

int main()
{
  fprintf(stderr, "Disabling fast memory management for MKL in order to use system malloc and free instead\n");
  int disableFastMMReturnVal = mkl_disable_fast_mm();
  fprintf(stderr, "  --> Reports value of %d (1 should mean MKL memory management turned off successfully)\n", disableFastMMReturnVal);

  //Use malloc to allocate a small array of chars
  char* tmpPtr;
  tmpPtr = (char*)malloc(4 * sizeof(char));
  tmpPtr[0] = 'f'; tmpPtr[1] = 'o'; tmpPtr[2] = 'o'; tmpPtr[3] = '\0';

  //Use malloc to allocate another small array of chars
  char* diffPtr;
  diffPtr = (char*)malloc(3 * sizeof(char));
  diffPtr[0] = 'h'; diffPtr[1] = 'i'; diffPtr[2] = '\0';

  //See that data is as expected
  fprintf(stderr, "TEMPPTR: %s DIFFPTR: %s\n", tmpPtr, diffPtr);
  //Just a no-op for this demo, but see that the wrapped free gets called
  free(diffPtr);
  free(tmpPtr);

  //Now, set up a MKL matrix multiply call:
  const int M = 128;
  const int K = 128;
  const int N = 128;
  const float alpha = 1.0;
  const float beta = 0.0;

  float A[M * K];
  float B[K * N];
  float C[M * N];

  //Initialize the input matrices to known values
  for (int r = 0; r < M; r++)
    for (int c = 0; c < K; c++)
      A[r * K + c] = r * c;

  for (int r = 0; r < K; r++)
    for (int c = 0; c < N; c++)
      B[r * N + c] = r + c;

  fprintf(stderr, "START CALL TO cblas_sgemm\n");
  cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, alpha, A, K, B, N, beta, C, N);
  fprintf(stderr, "FINISHED CALL TO cblas_sgemm\n");

  //Print some values from the output to check consistent results from run to run
  fprintf(stderr, "[0][0]: %f\n", C[0 * N + 0]);
  fprintf(stderr, "[20][20]: %f\n", C[20 * N + 20]);
  fprintf(stderr, "[40][40]: %f\n", C[40 * N + 40]);
  fprintf(stderr, "[100][100]: %f\n", C[100 * N + 100]);

  return 0;
}

这是我在没有 valgrind 的情况下运行时的输出：

# ./demo.exe 
Disabling fast memory management for MKL in order to use system malloc and free instead
  --> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000

这是我使用 valgrind 运行时的输出：

# valgrind ./demo.exe 
==487722== Memcheck, a memory error detector
==487722== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==487722== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==487722== Command: ./demo.exe
==487722== 
Disabling fast memory management for MKL in order to use system malloc and free instead
  --> Reports value of 1 (1 should mean MKL memory management turned off successfully)
In __wrap_malloc, cur offset: 0 requesting 4!
In __wrap_malloc, cur offset: 4 requesting 3!
TEMPPTR: foo DIFFPTR: hi
In __wrap_free!
In __wrap_free!
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 7 requesting 4344376!
In __wrap_malloc, cur offset: 4344383 requesting 69664!
In __wrap_malloc, cur offset: 4414047 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
==487722== 
==487722== HEAP SUMMARY:
==487722==     in use at exit: 0 bytes in 0 blocks
==487722==   total heap usage: 1 allocs, 1 frees, 72,704 bytes allocated
==487722== 
==487722== All heap blocks were freed -- no leaks are possible
==487722== 
==487722== For lists of detected and suppressed errors, rerun with: -s
==487722== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

请注意，如果没有 valgrind，在调用 cblas_sgemm 期间不会调用

__wrap_malloc

，但使用 valgrind 时，会调用 3 次。

编辑#2：正如 @Andrew Henle 所建议的，可能还有其他分配函数，所以我还为

calloc

和

realloc

添加了一个包装器。添加这两个新包装器后，它会产生如上所示的确切结果。接下来，我在可执行文件上运行

nm

（所有内容都是静态链接的），我得到以下信息：

nm -C demo.exe |& grep alloc
00000000004021f0 T cblas_xerbla_malloc_error
0000000000c7b008 D i_calloc
0000000000c7b000 D i_malloc
0000000000c7b010 D i_realloc
0000000001f94d40 b mkl_hbw_malloc_psize
00000000004438f0 T mkl_serv_allocate
000000000044a550 T mkl_serv_calloc
0000000000446030 T mkl_serv_deallocate
000000000044a660 T mkl_serv_jit_alloc
0000000000444b10 T mkl_serv_malloc
0000000000449860 T mkl_serv_realloc
0000000000443710 t mm_internal_malloc
0000000000442df0 t mm_internal_realloc
0000000001f94d50 b sys_alloc
0000000001f94d68 b sys_allocate
0000000001f94d70 b sys_deallocate
0000000001f94d58 b sys_realloc
0000000000401442 T __wrap_calloc
00000000004013e6 T __wrap_malloc
00000000004014b2 T __wrap_realloc
0000000001f8dd80 b __wrap_calloc::heapOff
0000000001604700 b __wrap_calloc::heapSpace
00000000016046e0 b __wrap_malloc::heapOff
0000000000c7b060 b __wrap_malloc::heapSpace

Answer 1

摘自 David Agans 9 条调试规则书：停止思考，看看。

第 0 步，复制您的输出

MKL=/opt/intel/oneapi/mkl/2024.1
g++ -pthread -g mkl.c -I ${MKL}/include -static -L ${MKL}/lib \
  -lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lmkl_core -ldl \
  $(for f in malloc free; do echo -Wl,--wrap,$f; done)

./a.out
...
START CALL TO cblas_sgemm
In __wrap_malloc, cur offset: 1189 requesting 47!
In __wrap_malloc, cur offset: 1236 requesting 24!
In __wrap_free!
In __wrap_malloc, cur offset: 1260 requesting 51!
In __wrap_free!
...
In __wrap_malloc, cur offset: 777399 requesting 256!
In __wrap_free!
FINISHED CALL TO cblas_sgemm
[0][0]: 0.000000
[20][20]: 17068800.000000
[40][40]: 40640000.000000
[100][100]: 150367936.000000
...

嗯，我在那一步失败了:-(

可能原因：

您没有告诉确切地您如何构建测试，这很重要
您正在使用不同版本的 MKL

您正在使用不同版本的 GCC / binutils（我使用

g++ (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)

和

GNU ld version 2.40-14.fc39

）。

还有别的事。

如果我能够复制你的行为，我的下一步将是：在 GDB 下运行程序，在所有分配函数上设置断点，禁用它们。在

cblas_sgemm

上设置断点。一旦命中该断点，重新启用所有其他断点，一旦命中其中一个断点，请使用

(gdb) where

找出未拦截的调用来自何处。

之后，为了弄清楚为什么它没有被拦截，我会检查调用函数，使用

.o

等查看其在

readelf -Wr foo.o

文件中的重定位记录。

在 MKL 中使用自定义 malloc 实现

问题描述投票：0回答：1

1个回答

最新问题

在 MKL 中使用自定义 malloc 实现

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1