使用 docker-compose 进行 Prometheus 服务发现

Question

我有以下 docker-compose 文件：

version: '3.4'

services:
    serviceA:
        image: <image>
        command: <command>
        labels:
           servicename: "service-A"
        ports:
         - "8080:8080"

    serviceB:
        image: <image>
        command: <command>
        labels:
           servicename: "service-B"
        ports:
         - "8081:8081"

    prometheus:
        image: prom/prometheus:v2.32.1
        container_name: prometheus
        volumes:
          - ./prometheus:/etc/prometheus
          - prometheus_data:/prometheus
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--storage.tsdb.retention.time=200h'
          - '--web.enable-lifecycle'
        restart: unless-stopped
        expose:
          - 9090

        labels:
          org.label-schema.group: "monitoring"

volumes:
    prometheus_data: {}

docker-compose 还包含具有以下配置的 Prometheus 实例：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.


scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090', 'serviceA:8080', 'serviceB:8081']

ServiceA 和 ServiceB 公开 prometheus 指标（每个指标都在其自己的端口上）。

当每个服务都有一个实例时，一切正常，但是当我想扩展服务并运行多个实例时，普罗米修斯指标收集开始弄乱指标收集，并且数据已损坏。

我为此问题寻找了 docker-compose 服务发现，但没有找到合适的。我该如何解决这个问题？

Answer 1

这个问题的解决方案是使用实际的服务发现而不是静态目标。这样，普罗米修斯将在每次迭代期间抓取每个副本。

如果只是 docker-compose（我的意思不是 Swarm），您可以使用 DNS 服务发现（

dns_sd_config）来获取属于某个服务的所有 IP：

# docker-compose.yml
version: "3"
services:
  prometheus:
    image: prom/prometheus

  test-service:  # <- this
    image: nginx
    deploy:
      replicas: 3
---
# prometheus.yml
scrape_configs:
  - job_name: test
    dns_sd_configs:
      - names:
          - test-service  # goes here
        type: A
        port: 80

这是启动和运行最简单的方法。

接下来，您可以使用专用的 Docker 服务发现：

docker_sd_config。除了目标列表之外，它还为您提供更多标签数据（例如容器名称、镜像版本等），但它还需要连接到 Docker 守护进程才能获取这些数据。在我看来，这对于开发环境来说是一种矫枉过正，但在生产中可能是必不可少的。这是一个示例配置，大胆复制粘贴自 https://github.com/prometheus/prometheus/blob/release-2.33/documentation/examples/prometheus-docker.yml :

# A example scrape configuration for running Prometheus with Docker.

scrape_configs:
  # Make Prometheus scrape itself for metrics.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Create a job for Docker daemon.
  #
  # This example requires Docker daemon to be configured to expose
  # Prometheus metrics, as documented here:
  # https://docs.docker.com/config/daemon/prometheus/
  - job_name: "docker"
    static_configs:
      - targets: ["localhost:9323"]

  # Create a job for Docker Swarm containers.
  #
  # This example works with cadvisor running using:
  # docker run --detach --name cadvisor -l prometheus-job=cadvisor
  #     --mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock,ro
  #     --mount type=bind,src=/,dst=/rootfs,ro
  #     --mount type=bind,src=/var/run,dst=/var/run
  #     --mount type=bind,src=/sys,dst=/sys,ro
  #     --mount type=bind,src=/var/lib/docker,dst=/var/lib/docker,ro
  #     google/cadvisor -docker_only
  - job_name: "docker-containers"
    docker_sd_configs:
      - host: unix:///var/run/docker.sock # You can also use http/https to connect to the Docker daemon.
    relabel_configs:
      # Only keep containers that have a `prometheus-job` label.
      - source_labels: [__meta_docker_container_label_prometheus_job]
        regex: .+
        action: keep
      # Use the task labels that are prefixed by `prometheus-`.
      - regex: __meta_docker_container_label_prometheus_(.+)
        action: labelmap
        replacement: $1

最后，还有

dockerswarm_sd_config，显然，它可以与 Docker Swarm 一起使用。这是三者中最复杂的，因此，有一个全面的官方设置指南。与 docker_sd_config

 一样，它在标签中包含有关容器的附加信息，甚至更多（例如，它可以告诉容器在哪个节点上）。此处提供了示例配置：

https://github.com/prometheus/prometheus/blob/release-2.33/documentation/examples/prometheus-dockerswarm.yml，但您应该真正阅读文档才能理解它并为自己调整。

Answer 2

据我了解，问题在于配置的抓取目标由不同的实例提供服务，每个实例持有不同的指标注册表。一般来说，这个问题有很多实例，例如 Kubernetes 水平 Pod 自动缩放、Docker Compose 副本或多线程 Web 服务器（其中隔离的工作进程处理请求）。

为每个 Pod、容器或 Worker 添加唯一标签（如@anemyte 建议）是一个有效的解决方案，但可能会导致

基数爆炸。本质上，每个新实例都会创建一个新计数器（时间序列）。根据您的保留期和生成新实例的频率，这可能会显着减慢您的 Prometheus 速度。

此外，如果您对按此额外标签进行分组不感兴趣，那么我觉得这不是问题的正确解决方案，而是一种解决方法，可能会在将来导致难以调试的问题。

对我来说效果很好的解决方案是使用桥接器，例如 StatsD，它接收原始信号并充当 Prometheus 的中央聚合点和目标。如果您有兴趣，我在多线程 Web 服务器的背景下写了这篇文章：

https://mkaz.me/blog/2023/collecting-metrics-from-multi-process-web-servers-the-ruby -案例/

使用 docker-compose 进行 Prometheus 服务发现

问题描述投票：0回答：2

2个回答

最新问题

使用 docker-compose 进行 Prometheus 服务发现

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2