应用监控

如果我们要在我们的生产环境真正使用 Prometheus，往往需要关注各种各样的指标，譬如服务器的 CPU 负载、内存占用量、IO 开销、入网和出网流量等等。正如上面所说，Prometheus 是使用 Pull 的方式来获取指标数据的，要让 Prometheus 从目标处获得数据，首先必须在目标上安装指标收集的程序，并暴露出 HTTP 接口供 Prometheus 查询，这个指标收集程序被称为 Exporter，不同的指标需要不同的 Exporter 来收集，目前已经有大量的 Exporter 可供使用，几乎囊括了我们常用的各种系统和软件，官网列出了一份常用 Exporter 的清单，各个 Exporter 都遵循一份端口约定，避免端口冲突，即从 9100 开始依次递增，这里是完整的 Exporter 端口列表。另外值得注意的是，有些软件和系统无需安装 Exporter，这是因为他们本身就提供了暴露 Prometheus 格式的指标数据的功能，比如 Kubernetes、Grafana、Etcd、Ceph 等。

主机节点监控

首先我们来收集服务器的指标，这需要安装 node_exporter，这个 exporter 用于收集 *NIX 内核的系统，如果你的服务器是 Windows，可以使用 WMI exporter。

和 Prometheus server 一样，node_exporter 也是开箱即用的；node_exporter 启动之后，我们访问下 /metrics 接口看看是否能正常获取服务器指标：

$ curl http://localhost:9100/metrics

然后可以在 Prometheus 中添加抓取配置：

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.0.107:9090"]
  - job_name: "server"
    static_configs:
      - targets: ["192.168.0.107:9100"]

一般情况下，node_exporter 都是直接运行在要收集指标的服务器上的，官方不推荐用 Docker 来运行 node_exporter。如果逼不得已一定要运行在 Docker 里，要特别注意，这是因为 Docker 的文件系统和网络都有自己的 namespace，收集的数据并不是宿主机真实的指标。可以使用一些变通的方法，比如运行 Docker 时加上下面这样的参数：

$ docker run -d \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter \
  --path.rootfs /host

Python 应用监控

这里我们以简单的 Python 应用为例，介绍如何在应用层面接入到 Prometheus 监控中：

import http.server
from prometheus_client import start_http_server

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()

start_http_server(8000) 会启动一个监听 8000 的 HTTP 服务器，我们访问 http://localhost:8000/ 即可以获取到当前的 Metrics：

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 372.0
python_gc_objects_collected_total{generation="1"} 0.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 34.0
python_gc_collections_total{generation="1"} 3.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="0",version="3.8.0"} 1.0

然后我们在 Prometheus 配置文件中添加如下的抓取配置：

global:
  scrape_interval: 10s
scrape_configs:
  - job_name: example
    static_configs:
      - targets:
          - localhost:8000

在 Prometheus 中输入 python_info PromQL 表达式，即可以看到如下的结果：

Evaluating the expression python_info produces one result

Counter

Counter 是您可能最常在仪器中使用的度量标准类型，Counter 跟踪事件的数量或大小，它们主要用于跟踪执行特定代码路径的频率。

from prometheus_client import Counter

REQUESTS = Counter('hello_worlds_total',
        'Hello Worlds requested.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

这里我们可以使用 rate(hello_worlds_total[1m]) 这样的表达式来查看接口请求的变化率：

其他的常用场景譬如统计错误率：

import random
from prometheus_client import Counter

REQUESTS = Counter('hello_worlds_total',
        'Hello Worlds requested.')
EXCEPTIONS = Counter('hello_world_exceptions_total',
        'Exceptions serving Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        with EXCEPTIONS.count_exceptions():
          if random.random() < 0.2:
            raise Exception
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

然后使用 rate(hello_world_exceptions_total[1m]) / rate(hello_worlds_total[1m]) 这样的表达式来计算错误率。在 Python 中，我们还可以使用 count_exceptions 注解：

EXCEPTIONS = Counter('hello_world_exceptions_total',
        'Exceptions serving Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @EXCEPTIONS.count_exceptions()
    def do_GET(self):
      # ...

Prometheus 使用 64 位浮点数作为值，因此您不仅限于将计数器递增 1。实际上，您可以将计数器增加任何非负数。这使您可以跟踪处理的记录数，提供的字节数或以欧元为单位的销售额：

import random
from prometheus_client import Counter

REQUESTS = Counter('hello_worlds_total',
        'Hello Worlds requested.')
SALES = Counter('hello_world_sales_euro_total',
        'Euros made serving Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        euros = random.random()
        SALES.inc(euros)
        self.send_response(200)
        self.end_headers()
        self.wfile.write("Hello World for {} euros.".format(euros).encode())

Gauge

Gauge 是某些当前状态的快照，对于计数器来说，增长速度是您所关心的，对于仪表而言，它是仪表的实际值。因此，值可以同时上升和下降。Gauges 包含了 inc, dec 以及 set 三个方法：

import time
from prometheus_client import Gauge

INPROGRESS = Gauge('hello_worlds_inprogress',
        'Number of Hello Worlds in progress.')
LAST = Gauge('hello_world_last_time_seconds',
        'The last time a Hello World was served.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        INPROGRESS.inc()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LAST.set(time.time())
        INPROGRESS.dec()

这些指标可以直接在表达式浏览器中使用，而无需任何其他功能。例如，hello_world_last_time_seconds 可用于确定何时提供最后一个 Hello World。这种度量标准的主要用例是检测自处理请求以来是否时间过长。PromQL 表达式 time() -hello_world_last_time_seconds 将告诉您自上次请求以来有多少秒。

from prometheus_client import Gauge

INPROGRESS = Gauge('hello_worlds_inprogress',
        'Number of Hello Worlds in progress.')
LAST = Gauge('hello_world_last_time_seconds',
        'The last time a Hello World was served.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @INPROGRESS.track_inprogress()
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LAST.set_to_current_time()

Summary

当您试图了解系统性能时，了解您的应用程序响应请求所花的时间或后端的延迟是至关重要的指标。其他仪器系统提供某种形式的 Timer 度量标准，但是 Prometheus 更一般地看待事物。正如计数器可以增加一个非 1 的值一样，您可能希望跟踪有关事件的信息，而不是其延迟。例如，除了后端延迟之外，您可能还希望跟踪您返回的响应的大小。最常用的方法就是 observe：

import time
from prometheus_client import Summary

LATENCY = Summary('hello_world_latency_seconds',
        'Time for a request Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        start = time.time()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LATENCY.observe(time.time() - start)

如果查看 /metrics，则会看到 hello_world_latency_seconds 度量标准具有两个时间序列：hello_world_latency_seconds_count 和 hello_world_latency_seconds_sum。hello_world_latency_seconds_count 是已进行的观察调用的数量，因此表达式浏览器中的 rate(hello_world_latency_seconds_count[1m]) 将返回 Hello World 请求的每秒速率。hello_world_latency_seconds_sum 是传递给观察值的总和，因此 rate(hello_world_latency_seconds_sum[1m]) 是每秒响应请求所花费的时间。如果将这两个表达式相除，您将获得最后一分钟的平均延迟。平均延迟的完整表达式为：

rate(hello_world_latency_seconds_sum[1m]) / rate(hello_world_latency_seconds_count[1m])

Histogram

Summary 将提供平均延迟，但是如果要分位数呢？分位数告诉您，一定比例的事件的大小小于给定值。例如，0.95 分位数为 300 毫秒，这意味着 95％的请求花费的时间少于 300 毫秒。在推理实际的最终用户体验时，分位数很有用。如果用户的浏览器向您的应用程序发出 20 个并发请求，则确定用户可见延迟的时间是最慢的。在这种情况下，第 95 个百分位数捕获了该延迟。

from prometheus_client import Histogram

LATENCY = Histogram('hello_world_latency_seconds',
        'Time for a request Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @LATENCY.time()
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

这将产生一组名称为 hello_world_latency_seconds_bucket 的时间序列，这是一组计数器。直方图具有一组存储桶，例如 1 ms，10 ms 和 25 ms，用于跟踪落入每个存储桶的事件数。histogram_quantile PromQL 函数可以从存储桶中计算分位数。例如，0.95 分位数（第 95 个百分位数）为：

histogram_quantile(0.95, rate(hello_world_latency_seconds_bucket[1m]))

Buckets

默认存储桶的延迟范围为 1ms 到 10s，这旨在捕获 Web 应用程序的典型延迟范围。但是，您也可以在定义指标时覆盖它们并提供自己的存储桶。如果默认值不适合您的用例，或者为服务级别协议（SLA）中提到的等待时间分位数添加显式存储桶，则可以执行此操作。为了帮助您检测 typos，必须对提供的存储桶进行排序：

LATENCY = Histogram('hello_world_latency_seconds',
        'Time for a request Hello World.',
        buckets=[0.0001, 0.0002, 0.0005, 0.001, 0.01, 0.1])

如果您想要线性或指数存储桶，则可以使用 Python 列表推导。与列表解析不等效的语言的客户端库可能包括以下工具功能：

buckets=[0.1 * x for x in range(1, 10)]    # Linear
buckets=[0.1 * 2**x for x in range(1, 10)] # Exponential

最近更新于 0001-01-01