我有以下 docker-compose 文件:
version: '3.4'
services:
serviceA:
image: <image>
command: <command>
labels:
servicename: "service-A"
ports:
- "8080:8080"
serviceB:
image: <image>
command: <command>
labels:
servicename: "service-B"
ports:
- "8081:8081"
prometheus:
image: prom/prometheus:v2.32.1
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
expose:
- 9090
labels:
org.label-schema.group: "monitoring"
volumes:
prometheus_data: {}
docker-compose 还包含具有以下配置的 Prometheus 实例:
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090', 'serviceA:8080', 'serviceB:8081']
ServiceA 和 ServiceB 公开 prometheus 指标(每个指标都在其自己的端口上)。
当每个服务都有一个实例时,一切正常,但是当我想扩展服务并运行多个实例时,普罗米修斯指标收集开始弄乱指标收集,并且数据已损坏。
我为此问题寻找了 docker-compose 服务发现,但没有找到合适的。我该如何解决这个问题?
如果只是 docker-compose(我的意思不是 Swarm),您可以使用 DNS 服务发现(
dns_sd_config)来获取属于某个服务的所有 IP:
# docker-compose.yml
version: "3"
services:
prometheus:
image: prom/prometheus
test-service: # <- this
image: nginx
deploy:
replicas: 3
---
# prometheus.yml
scrape_configs:
- job_name: test
dns_sd_configs:
- names:
- test-service # goes here
type: A
port: 80
这是启动和运行最简单的方法。接下来,您可以使用专用的 Docker 服务发现:
docker_sd_config。除了目标列表之外,它还为您提供更多标签数据(例如容器名称、镜像版本等),但它还需要连接到 Docker 守护进程才能获取这些数据。在我看来,这对于开发环境来说是一种矫枉过正,但在生产中可能是必不可少的。这是一个示例配置,大胆复制粘贴自 https://github.com/prometheus/prometheus/blob/release-2.33/documentation/examples/prometheus-docker.yml :
# A example scrape configuration for running Prometheus with Docker.
scrape_configs:
# Make Prometheus scrape itself for metrics.
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Create a job for Docker daemon.
#
# This example requires Docker daemon to be configured to expose
# Prometheus metrics, as documented here:
# https://docs.docker.com/config/daemon/prometheus/
- job_name: "docker"
static_configs:
- targets: ["localhost:9323"]
# Create a job for Docker Swarm containers.
#
# This example works with cadvisor running using:
# docker run --detach --name cadvisor -l prometheus-job=cadvisor
# --mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock,ro
# --mount type=bind,src=/,dst=/rootfs,ro
# --mount type=bind,src=/var/run,dst=/var/run
# --mount type=bind,src=/sys,dst=/sys,ro
# --mount type=bind,src=/var/lib/docker,dst=/var/lib/docker,ro
# google/cadvisor -docker_only
- job_name: "docker-containers"
docker_sd_configs:
- host: unix:///var/run/docker.sock # You can also use http/https to connect to the Docker daemon.
relabel_configs:
# Only keep containers that have a `prometheus-job` label.
- source_labels: [__meta_docker_container_label_prometheus_job]
regex: .+
action: keep
# Use the task labels that are prefixed by `prometheus-`.
- regex: __meta_docker_container_label_prometheus_(.+)
action: labelmap
replacement: $1
最后,还有 dockerswarm_sd_config,显然,它可以与 Docker Swarm 一起使用。这是三者中最复杂的,因此,有一个全面的官方设置指南。与 docker_sd_config
一样,它在标签中包含有关容器的附加信息,甚至更多(例如,它可以告诉容器在哪个节点上)。此处提供了示例配置:https://github.com/prometheus/prometheus/blob/release-2.33/documentation/examples/prometheus-dockerswarm.yml,但您应该真正阅读文档才能理解它并为自己调整。
为每个 Pod、容器或 Worker 添加唯一标签(如@anemyte 建议)是一个有效的解决方案,但可能会导致
基数爆炸。本质上,每个新实例都会创建一个新计数器(时间序列)。根据您的保留期和生成新实例的频率,这可能会显着减慢您的 Prometheus 速度。
此外,如果您对按此额外标签进行分组不感兴趣,那么我觉得这不是问题的正确解决方案,而是一种解决方法,可能会在将来导致难以调试的问题。对我来说效果很好的解决方案是使用桥接器,例如 StatsD,它接收原始信号并充当 Prometheus 的中央聚合点和目标。如果您有兴趣,我在多线程 Web 服务器的背景下写了这篇文章:
https://mkaz.me/blog/2023/collecting-metrics-from-multi-process-web-servers-the-ruby -案例/