prometheus告警-以CPU使用率告警为例-尧图手机网站定制

以一个完整的告警流程为例从配置到具体告警的整个流程详细剖析各个环节整体流程的时序图如下Notification Channel(Webhook/Email/Slack)Alertmanager(端口: 9093)Alert Rules Engine(内部组件)Prometheus Server(端口: 9090)Node Exporter(端口: 9100)Notification Channel(Webhook/Email/Slack)Alertmanager(端口: 9093)Alert Rules Engine(内部组件)Prometheus Server(端口: 9090)Node Exporter(端口: 9100)阶段 1: 数据采集 (Scraping)阶段 2: 规则计算 (Evaluation)规则示例:IF avg(rate(node_cpu_seconds_total{modeidle}[5m])) 0.9THEN CPU 10%阶段 3: 告警触发与发送 (Firing)包含:- labels (实例名, 严重等级)- annotations (详细描述)- startsAt 时间戳alt[条件满足 (CPU 10%)][条件不满足]阶段 4: 告警处理与通知 (Routing Notify)流程结束等待下一个评估周期暴露指标 (Metrics)http://node-ip:9100/metrics(包含 node_cpu_seconds_total 等)GET /metrics (每 15s)返回原始文本数据将采集的数据存入 TSDB周期性评估告警规则(例如每 30s 执行一次)状态变更: Pending - Firing发送 POST /api/v2/alerts(JSON 格式的告警对象)状态维持: Inactive 或 Resolved发送 resolved 状态 (如果之前已触发)1. 去重 (Deduplication)2. 分组 (Grouping)3. 抑制 (Inhibition)4. 静默 (Silencing)调用配置的 Receiver(例如: POST http://webhook.site/...)返回接收确认 (200 OK)配置prometheus中关于告警的配置如下主要包含两部分一个是告警labels管理部分主要进行labels的替换重命名过滤等操作另外一部分是指定可用的告警器。// AlertingConfig configures alerting and alertmanager related configs.typeAlertingConfigstruct{// 告警 labels relabel configAlertRelabelConfigs[]*relabel.Configyaml:alert_relabel_configs,omitempty// Alertmanager configsAlertmanagerConfigs AlertmanagerConfigsyaml:alertmanagers,omitempty}relabel.Config// Config is the configuration for relabeling of target label sets.type Configstruct{// A list of labels from which values are taken and concatenated// with the configured separator in order.SourceLabels model.LabelNamesyaml:source_labels,flow,omitempty json:sourceLabels,omitempty// Separator is the string between concatenated values from the source labels. // 将标签的value值进行合并时使用的分隔符Separatorstringyaml:separator,omitempty json:separator,omitempty// Regex against which the concatenation is matched.// Default is (.*).Regex Regexpyaml:regex,omitempty json:regex,omitempty// Modulus to take of the hash of concatenated values from the source labels.Modulusuint64yaml:modulus,omitempty json:modulus,omitempty// TargetLabel is the label to which the resulting string is written in a replacement. // Regexp interpolation is allowed for the replace action.TargetLabelstringyaml:target_label,omitempty json:targetLabel,omitempty// Replacement is the regex replacement pattern to be used. // 匹配成功则将匹配到的内容进行替换默认为$1Replacementstringyaml:replacement,omitempty json:replacement,omitempty// Action is the action to be performed for the relabeling.Action Actionyaml:action,omitempty json:action,omitempty}AlertmanagerConfigs// AlertmanagerConfigs is a slice of *AlertmanagerConfig.typeAlertmanagerConfigs[]*AlertmanagerConfig// AlertmanagerConfig configures how Alertmanagers can be discovered and communicated with.typeAlertmanagerConfigstruct{// We cannot do proper Go type embedding below as the parser will then parse// values arbitrarily into the overflow maps of further-down types.ServiceDiscoveryConfigs discovery.Configsyaml:-HTTPClientConfig config.HTTPClientConfigyaml:,inlineSigV4Config*sigv4.SigV4Configyaml:sigv4,omitempty// The URL scheme to use when talking to Alertmanagers.Schemestringyaml:scheme,omitempty// Path prefix to add in front of the push endpoint path.PathPrefixstringyaml:path_prefix,omitempty// The timeout used when sending alerts.Timeout model.Durationyaml:timeout,omitempty// The api version of Alertmanager.APIVersion AlertmanagerAPIVersionyaml:api_version// List of Alertmanager relabel configurations.RelabelConfigs[]*relabel.Configyaml:relabel_configs,omitempty// Relabel alerts before sending to the specific alertmanager.AlertRelabelConfigs[]*relabel.Configyaml:alert_relabel_configs,omitempty}配置示例alerting:alertmanagers:# 静态告警配置-static_configs:# 以下填写自己的告警器地址这里使用的是在本机启动的alertmanager监听的端口 9093-targets:-localhost:9093# 可选配置警报重标记alert_relabel_configs:# 实际告警时有些labels可能需要过滤或者隐藏或者进行替换这个时候就需要使用 alert_relabel_configs# source_labels 指明哪些lables需要处理-source_labels:[instance]regex:([^:]):\dtarget_label:hostnamereplacement:$1关于alertmanagers字段怎样解析的可以阅读代码funcreadConfigs(structVal reflect.Value,startFieldint)(Configs,error){var(configs Configs targets[]*targetgroup.Group)fori,n:startField,structVal.NumField();in;i{field:structVal.Field(i)iffield.Kind()!reflect.Slice{panic(discovery: internal error: field is not a slice)}fork:0;kfield.Len();k{val:field.Index(k)ifval.IsZero()||(val.Kind()reflect.Ptrval.Elem().IsZero()){key:configFieldNames[field.Type().Elem()]keystrings.TrimPrefix(key,configFieldPrefix)returnnil,fmt.Errorf(empty or null section in %s,key)}switchc:val.Interface().(type){case*targetgroup.Group:// Add index to the static config target groups for unique identification// within scrape pool.c.Sourcestrconv.Itoa(len(targets))// Coalesce multiple static configs into a single static config.targetsappend(targets,c)caseConfig:configsappend(configs,c)default:panic(discovery: internal error: slice element is not a Config)}}}iflen(targets)0{configsappend(configs,StaticConfig(targets))}returnconfigs,nil}以下是对 alert_relabel_configs 配置的详细解析alert_relabel_configs:# 实际告警时有些labels可能需要过滤或者隐藏或者进行替换这个时候就需要使用 alert_relabel_configs# source_labels 指明哪些lables需要处理-source_labels:[instance]# 使用正则表达式匹配valuesregex:([^:]):\d# 将匹配的labels的key替换成 hostnametarget_label:hostname# 将匹配的values替换成首个正则匹配的值, 这里就是 ([^:]) 匹配的值# 如果value的值是 localhost:9100那么这里匹配的就是 localhostreplacement:$1# Action 没有指定默认是替换 replace这里的替换是将对应lable替换成指定内容如果是向剔除掉对应的label可以指定Action为 labeldropreplace只是使用新的label替换原先的label在保留原先label存在的情况下新增一个label举个例子node_cpu_seconds_total 指标的完整数据如下node_cpu_seconds_total{cpu0,instancelocalhost:9100,jobnode_export,modeidle}经过以上relable处理告警给alertmanger的数据就会变成node_cpu_seconds_total{cpu0,instancelocalhost:9100,hostnamelocalhost,jobnode_export,modeidle}环境搭建给prometheus配置个告警当系统CPU使用率大于 10%的时候就进行告警将添加告警rules配置环境搭建过程中是使用docker-compose搭建的因此这里需要将本地的告警配置文件先通过docker-compose.yml挂在到prometheus容器内部services:prometheus-compose:image:prom/prometheuscontainer_name:prometheus-composevolumes:-./yaml/prometheus/:/prometheus/# 本地的告警规则放到了 ./yaml/prometheus/rules/-./yaml/prometheus/rules/:/prometheus/rules/-./data:/prometheus/datacommand:---web.listen-address:9800network_mode:hostpid:hostdepends_on:-node_exporter创建alert_rules.yml并添加告警规则groups:-name:host_alertsrules:# CPU 使用率警报-alert:HighCPUUsageexpr:|100 - ( avg by(instance) ( rate(node_cpu_seconds_total{modeidle}[5m]) ) * 100 ) 10for:30s# 持续30s才触发labels:severity:warningteam:infrastructurealert_type:resourceannotations:summary:高CPU使用率 (实例 {{ $labels.instance }})description:|{{ $labels.instance }} 的CPU使用率超过10%。当前值: {{ $value | printf %.2f }}% 阈值: 10%dashboard: http://localhost:3000/d/node-exporter-full runbook: https://example.com/runbooks/high-cpu-usage 在promehteus的配置字段rule_files中指定告警配置global:rule_files:-/prometheus/rules/alert_rules.yml创建alertmanager.yml配置global:# 全局配置smtp_smarthost:smtp.gmail.com:587# 如果使用邮件通知smtp_from:alertsexample.comsmtp_auth_username:your-emailgmail.comsmtp_auth_password:your-app-passwordsmtp_require_tls:true# 路由配置route:group_by:[alertname,severity,instance]group_wait:30s# 等待时间收集同一组的警报group_interval:5m# 同一组警报发送间隔repeat_interval:4h# 重复警报发送间隔receiver:default-receiver# 子路由可根据标签路由到不同接收器routes:-match:severity:criticalreceiver:critical-receivergroup_wait:10srepeat_interval:30m-match:severity:warningreceiver:warning-receivergroup_wait:1mrepeat_interval:2h# 抑制规则减少重复警报inhibit_rules:-source_match:severity:criticaltarget_match:severity:warningequal:[alertname,instance]# 接收器配置receivers:# 默认接收器Webhook 示例-name:default-receiverwebhook_configs:-url:http://webhook.site/your-unique-url# 测试用send_resolved:true# 邮件通知可选email_configs:-to:adminexample.comsend_resolved:trueheaders:subject:[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}# 关键警报接收器-name:critical-receiverwebhook_configs:-url:http://webhook.site/critical-alertssend_resolved:trueemail_configs:-to:oncall-teamexample.comsend_resolved:true# 警告接收器-name:warning-receiverwebhook_configs:-url:http://webhook.site/warning-alertssend_resolved:true使用chaosblade创建CPU异常保证CPU使用率能大于10%./blade create cpu load --cpu-percent30以上工作做完我们就可以启动prometheus了因为使用的是prometheus-compose因此启动环境很简单只需要在docker-compose.yml文件所在的文件夹执行 docker-compose up -d 就可以告警数据分析告警发生时我们会在alertmanager界面上看到如下告警信息我们逐个字段分析下这些信息的来源alertname“HighCPUUsage”alertname“HighCPUUsage”alertname来自告警规则 alert_rules.yml中的 groups.rules.alert字段用来标识告警是由哪个告警规则产生的instance“localhost:9100”instance 是node_cpu_seconds_total指标本身自带的标签severity“warning”severity 来自告警规则 alert_rules.yml中的 groups.rules.labels.severity定义告警规则时可以指定新的labels这些labels以及指标经过告警规则处理之后自带的labels都能被alertmanager作为告警路由组划分的依据比如这里是按照 group_by: [‘alertname’, ‘severity’, ‘instance’] 三个字段进行告警组划分一旦告警组划分之后就可以对这个逻辑组绑定一些共用的告警规则alert_type“resource”alert_type 来自告警规则的配置hostname“localhost”hostname来自 promtheus的告警信息relabelteam“infrastructure”来自告警器配置对告警信息进行抓包可以看到告警上报的信息如下因为中文编码有问题其中的 …就是中文注释部分[{annotations:{dashboard: http://localhost:3000/d/node-exporter-full ,description:localhost:9100 ...CPU...............10%...\n.........: 30.24%\n......: 10%\n,runbook: https://example.com/runbooks/high-cpu-usage ,summary:...CPU......... (...... localhost:9100)},endsAt:2026-03-07T08:35:58.194Z,startsAt:2026-03-07T08:31:58.194Z,generatorURL:http://andrew:9800/graph?g0.expr100-%28avgby%28instance%29%28rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29%29%2A100%29%3E10g0.tab1,labels:{alert_type:resource,alertname:HighCPUUsage,hostname:localhost,instance:localhost:9100,severity:warning,team:infrastructure}}]

prometheus告警-以CPU使用率告警为例

相关新闻

亲测！专业展厅文化墙选择实践分享

如何借鉴OpenClaw的思路开发自己的Agent？

EagleEye避坑指南：解决上传无反应、检测框偏移等实战问题

最新新闻

hexo-tag-aplayer从入门到精通：构建博客音乐系统的完整路线图

网盘直链下载助手完整指南：一键获取八大网盘真实下载地址的终极解决方案

如何扩展Runno：添加自定义编程语言运行时的完整指南

对字符串排序的影响

Runno高级调试技巧：解决复杂代码执行问题的完整方法

Instatic集群部署：负载均衡与会话共享配置指南

日新闻

B站视频下载神器BiliTools：5分钟学会轻松保存任何B站内容

威胁模型全解析：从新手入门到实战应用，助你构建安全产品！

渗透测试入门指南：从零基础到实战环境搭建

周新闻

B站视频下载神器BiliTools：5分钟学会轻松保存任何B站内容

威胁模型全解析：从新手入门到实战应用，助你构建安全产品！

渗透测试入门指南：从零基础到实战环境搭建

月新闻