使用 sar、sysstat 获取每个进程的内存使用情况

Question

我可以获取 Linux 每个进程的内存使用情况吗？我们使用 sysstat/sar 监控我们的服务器。但除此之外看到记忆在某个时刻从屋顶上消失，我们无法确定这个过程变得越来越大。有没有办法使用 sar （或其他工具）来获取内存使用情况每个进程？稍后再看看？

Answer 1

sysstat

包括

pidstat

，其手册页显示：

pidstat
命令用于监视当前由Linux内核管理的各个任务。它为使用选项
-p
选择的每个任务或由 Linux 内核管理的每个任务写入标准输出活动 [...]

Linux 内核任务包括用户空间进程和线程（还有内核线程，这里最不感兴趣）。

但不幸的是

sysstat

不支持从

pidstat

收集历史数据，并且作者似乎没有兴趣提供此类支持（GitHub问题）：

pidstat

话虽这么说，

pidstat

的表格输出可以写入文件并稍后进行解析。通常，我们感兴趣的是进程组，而不是系统上的每个进程。我将重点关注一个进程及其子进程。

可以举个什么例子吗？火狐。

pgrep firefox

返回其 PID，

$(pgrep -d, -P $(pgrep firefox))

返回其子级的逗号分隔列表。鉴于此，

pidstat

命令可以如下所示：

LC_NUMERIC=C.UTF-8 watch pidstat -dru -hl \
    -p '$(pgrep firefox),$(pgrep -d, -P $(pgrep firefox))' \
    10 60 '>>' firefox-$(date +%s).pidstat

一些观察：

```
LC_NUMERIC
```
设置为使
```
pidstat
```
使用点作为小数点分隔符。
```
watch
```
用于每 600 秒重复一次
```
pidstat
```
，以防处理子树更改。
```
-d
```
报告 I/O 统计信息、
```
-r
```
报告页面错误和内存利用率、
```
-u
```
到
```
report CPU utilization
```
。
```
-h
```
使所有报告组放置在一行中，
```
-l
```
显示进程命令名称及其所有参数（嗯，有点，因为它仍然将其修剪为 127 个字符）。
```
date
```
用于避免意外覆盖现有文件

它会产生类似的东西：

Linux kernel version (host)     31/03/20    _x86_64_    (8 CPU)

#      Time   UID       PID    %usr %system  %guest    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
 1585671289  1000      5173    0.50    0.30    0.00    0.80     5      0.70      0.00 3789880  509536   3.21      0.00     29.60      0.00       0  /usr/lib/firefox/firefox 
 1585671289  1000      5344    0.70    0.30    0.00    1.00     1      0.50      0.00 3914852  868596   5.48      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 1 ...
 1585671289  1000      5764    0.10    0.10    0.00    0.20     1      7.50      0.00 9374676  363984   2.29      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 2 ...
 1585671289  1000      5852    6.60    0.90    0.00    7.50     7    860.70      0.00 4276640 1040568   6.56      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 3 ...
 1585671289  1000     24556    0.00    0.00    0.00    0.00     7      0.00      0.00  419252   18520   0.12      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -parentBuildID ...

#      Time   UID       PID    %usr %system  %guest    %CPU   CPU  minflt/s  majflt/s     VSZ     RSS   %MEM   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
 1585671299  1000      5173    3.40    1.60    0.00    5.00     6      7.60      0.00 3789880  509768   3.21      0.00     20.00      0.00       0  /usr/lib/firefox/firefox 
 1585671299  1000      5344    5.70    1.30    0.00    7.00     6    410.10      0.00 3914852  869396   5.48      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 1 ...
 1585671299  1000      5764    0.00    0.00    0.00    0.00     3      0.00      0.00 9374676  363984   2.29      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 2 ...
 1585671299  1000      5852    1.00    0.30    0.00    1.30     1     90.20      0.00 4276640 1040452   6.56      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -childID 3 ...
 1585671299  1000     24556    0.00    0.00    0.00    0.00     7      0.00      0.00  419252   18520   0.12      0.00      0.00      0.00       0  /usr/lib/firefox/firefox -contentproc -parentBuildID ...

...

请注意，每行数据都以空格开头，因此解析很容易：

import pandas as pd

def read_columns(filename):
    with open(filename) as f:
        for l in f:
            if l[0] != '#':
                continue
            else:
                return l.strip('#').split()
        else:
            raise LookupError

def get_lines(filename, colnum):
    with open(filename) as f:
        for l in f:
            if l[0] == ' ':
                yield l.split(maxsplit=colnum - 1)        

filename = '/path/to/firefox.pidstat'
columns = read_columns(filename)
exclude = 'CPU', 'UID', 
df = pd.DataFrame.from_records(
    get_lines(filename, len(columns)), columns=columns, exclude=exclude
)
numcols = df.columns.drop('Command')
df[numcols] = df[numcols].apply(pd.to_numeric, errors='coerce')
df['RSS'] = df.RSS / 1024  # Make MiB
df['Time'] = pd.to_datetime(df['Time'], unit='s', utc=True)
df = df.set_index('Time')
df.info()

数据帧的结构如下：

Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   PID        6155 non-null   int64  
 1   %usr       6155 non-null   float64
 2   %system    6155 non-null   float64
 3   %guest     6155 non-null   float64
 4   %CPU       6155 non-null   float64
 5   minflt/s   6155 non-null   float64
 6   majflt/s   6155 non-null   float64
 7   VSZ        6155 non-null   int64  
 8   RSS        6155 non-null   float64
 9   %MEM       6155 non-null   float64
 10  kB_rd/s    6155 non-null   float64
 11  kB_wr/s    6155 non-null   float64
 12  kB_ccwr/s  6155 non-null   float64
 13  iodelay    6155 non-null   int64  
 14  Command    6155 non-null   object 
dtypes: float64(11), int64(3), object(1)

它可以通过多种方式可视化，具体取决于监控的重点是什么，但

%CPU

和

RSS

是最常见的指标。这是一个例子。

import matplotlib.pyplot as plt

fig, axes = plt.subplots(len(df.PID.unique()), 2, figsize=(12, 8))
x_range = [df.index.min(), df.index.max()]
for i, pid in enumerate(df.PID.unique()):
    subdf = df[df.PID == pid]
    title = ', '.join([f'PID {pid}', str(subdf.index.max() - subdf.index.min())])
    for j, col in enumerate(('%CPU', 'RSS')):
        ax = subdf.plot(
            y=col, title=title if j == 0 else None, ax=axes[i][j], sharex=True
       )
        ax.legend(loc='upper right')
        ax.set_xlim(x_range)

plt.tight_layout()
plt.show()

它产生这样的图形：

Answer 2

这纯粹是偏好，但我会保持它的美好和简单，直到你知道你在寻找什么。我会创建一个

cronjob

首先通过管道输出您的可用内存、磁盘和 CPU 使用情况，然后显示前十个罪魁祸首。

#!/bin/sh
free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'
df -h | awk '$NF=="/"{printf "Disk Usage: %d/%dGB (%s)\n", $3,$2,$5}'
top -bn1 | grep load | awk '{printf "CPU Load: %.2f\n", $(NF-2)}' 
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head

找到罪魁祸首后，您可以进一步磨练并深入研究一些细节。

Answer 3

我 100% 同意 Jason Birchall 的观点，即在深入研究之前追查罪魁祸首。我注意到，如果你在后面加上参数， ps -eo 输出会切断大部分 cmd。因此，执行以下操作可以维护完整的、可能非常长的命令行：

ps -eo pid,ppid,%mem,%cpu,cmd, --sort=-%mem | head

我修改了他的脚本，使其在我的主目录中作为leakage.sh 运行（我使用 tmux 将其作为“背景”）并将输出写入文件。现在我可以观看它并且在重新启动后仍然可以看到它。我仍在追踪我的问题，因此在改进过程中可能会进行编辑。

while [ true ]; do ./leakage.sh > leakage; cat leakage; sleep 5; done

使用 sar、sysstat 获取每个进程的内存使用情况

问题描述投票：0回答：3

3个回答

`pidstat`

最新问题

使用 sar、sysstat 获取每个进程的内存使用情况

问题描述 投票：0回答：3

3个回答

pidstat

最新问题

问题描述投票：0回答：3

`pidstat`