我正在阅读 Michael Kerrisk 的“操作中的命名空间”系列(因为我想在 Linux 中自己实现一个容器),我发现自己想知道一些事情:
在 Michael 的 PID 命名空间使用示例之一中,他编写了下一个程序: https://lwn.net/Articles/532745/
这里,进程的堆栈是一个静态分配的缓冲区,根据 Michaels 自己的注释,它是“
/* Space for child's stack */
/* Since each child gets a copy of virtual memory, this
buffer can be reused as each child creates its child */
据我了解,这是在UNIX系统中利用COW机制的尝试。
现在,这段代码中似乎存在一个错误,该错误是(根据迈克尔自己的解释):
命名空间/multi_pidns.c
Allocate stacks for the child processes on the heap rather
than in static memory. Marcos Paulo de Souza pointed out
that the children were being killed by SIGSEGV after they
had completed the sleep() calls. (Some further investigation
showed that all children except the *last* are killed with
SIGSEGV.) It appears that they are killed after the child
start function returns. The problem goes away if the
children are allocated stacks in separate memory areas by
calling malloc() (which is the change made in this patch)
or in separate statically allocated buffers.
The reason that the children were killed is based on (my
misunderstanding of) the subtleties of the magic done
in the glibc clone() wrapper function. (See, for example,
the x86-64 implementation in the glibc source file
sysdeps/unix/sysv/linux/x86_64/clone.S.) The
previous code was relying on the fact that the parent's
memory was duplicated in the child during the clone() system
call, and the assumption that that duplicated memory could be
used in the child. However, before executing the clone()
system call, the clone() wrapper function saves some
information (that will be used by the child) onto the stack.
This happens in the address space of the parent, before the
memory is duplicated in the system call. Since the previous
code was making use of the same statically allocated buffer
(i.e., the same address as was used for the parent's stack)
for the child stack, the consequence was that the steps in
the clone() wrapper function were corrupting the stack of the
*parent* process, which ultimately resulted in (all but the
last of) the child processes crashing.
固定程序在这里: https://man7.org/tlpi/code/online/dist/namespaces/multi_pidns.c.html
我觉得我对堆栈损坏发生方式的理解实际上还很遥远。如果每个子进程中都有分配的堆栈的副本,并且子进程可以在不破坏父进程堆栈的情况下写入数据,为什么会发生这种情况?
在迈克尔或互联网上找不到任何其他彻底的解释。
我试图找到 Michael 对此事的彻底解释,询问 ChatGPT,阅读有关clone() 系统调用以及有关 COW 和重用堆栈的堆栈分配的信息。找不到有效答案。
我还没有看过
glibc
代码,但我认为这就是它所描述的:
clone()
函数获取一个指向应用作子进程堆栈的内存的指针。它假设该内存仅用于该目的,并且在由 clone()
系统调用生成后,它只会被视为子进程中的堆栈。所以它在工作时用它来保存一些临时数据。
发生的情况是,当其中一个子进程创建一个新子进程时,它自己的堆栈与该静态缓冲区位于同一内存位置。因此,当
clone()
在那里存储临时数据时,它会覆盖进程自己的堆栈中的一些字段。
令人困惑的是该解释中“父母”的使用。对于原始父进程来说这不是问题,但是程序递归地创建了进程的层次结构,并且问题发生在这些嵌套进程中,因为静态缓冲区与它们的堆栈重叠。