如何使用Prometheus配置自定义告警规则

2020 年 5 月 25 日

如何使用Prometheus配置自定义告警规则

前言


Prometheus 是一个用于监控和告警的开源系统。一开始由 Soundcloud 开发,后来在 2016 年,它迁移到 CNCF 并且称为 Kubernetes 之后最流行的项目之一。从整个 Linux 服务器到 stand-alone web 服务器、数据库服务或一个单独的进程,它都能监控。在 Prometheus 术语中,它所监控的事物称为目标(Target)。每个目标单元被称为指标(metric)。它以设置好的时间间隔通过 http 抓取目标,以收集指标并将数据放置在其时序数据库(Time Series Database)中。你可以使用 PromQL 查询语言查询相关 target 的指标。


本文中,我们将一步一步展示如何:


  • 安装Prometheus(使用prometheus-operator Helm chart)以基于自定义事件进行监控/告警

  • 创建和配置自定义告警规则,它将会在满足条件时发出告警

  • 集成Alertmanager以处理由客户端应用程序(在本例中为Prometheus server)发送的告警

  • 将Alertmanager与发送告警通知的邮件账户集成。


理解 Prometheus 及其抽象概念


从下图我们将看到所有组成 Prometheus 生态的组件:



Prometheus 生态中的所有组件


(来源:)


以下是与本文相关的术语,大家可以快速了解:


  • Prometheus Server :在时序数据库中抓取和存储指标的主要组件

  • 抓取 :一种拉取方法以获取指标。它通常以10-60秒的时间间隔抓取。

  • Target :检索数据的server客户端

  • 服务发现 :启用Prometheus,使其能够识别它需要监控的应用程序并在动态环境中拉取指标

  • Alert Manager :负责处理警报的组件(包括silencing、inhibition、聚合告警信息,并通过邮件、PagerDuty、Slack等方式发送告警通知)。

  • 数据可视化 :抓取的数据存储在本地存储中,并使用PromQL直接查询,或通过Grafana dashboard查看。


理解 Prometheus Operator


根据 Prometheus Operator 的项目所有者 CoreOS 称,Prometheus Operator 可以配置原生 Kubernetes 并且可以管理和操作 Prometheus 和 Alertmanager 集群。


该 Operator 引入了以下 Kubernetes 自定义资源定义(CRDs):Prometheus、ServiceMonitor、PrometheusRule 和 Alertmanager。如果你想了解更多内容可以访问链接:


https://github.com/coreos/prometheus-operator/blob/master/Documentation/design.md


在我们的演示中,我们将使用 PrometheusRule 来定义自定义规则。


首先,我们需要使用 stable/prometheus-operator Helm chart 来安装 Prometheus Operator,下载链接:


https://github.com/helm/charts/tree/master/stable/prometheus-operator


默认安装程序将会部署以下组件:prometheus-operator、prometheus、alertmanager、node-exporter、kube-state-metrics 以及 grafana。默认状态下,Prometheus 将会抓取 Kubernetes 的主要组件:kube-apiserver、kube-controller-manager 以及 etcd。


安装 Prometheus 软件


前期准备


要顺利执行此次 demo,你需要准备以下内容:


  • 一个Google Cloud Platform账号(免费套餐即可)。其他任意云也可以

  • Rancher v2.3.5(发布文章时的最新版本)

  • 运行在GKE(版本1.15.9-gke.12.)上的Kubernetes集群(使用EKS或AKS也可以)

  • 在计算机上安装好Helm binary


启动一个 Rancher 实例


直接按照这一直观的入门指南进行操作即可:


https://rancher.com/quick-start


使用 Rancher 部署一个 GKE 集群


使用 Rancher 来设置和配置你的 Kubernetes 集群:


https://rancher.com/docs/rancher/v2.x/en/cluster-provisioning/hosted-kubernetes-clusters/gke/


部署完成后,并且为 kubeconfig 文件配置了适当的 credential 和端点信息,就可以使用 kubectl 指向该特定集群。


部署 Prometheus 软件


首先,检查一下我们所运行的 Helm 版本


$ helm versionversion.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}
复制代码


当我们使用 Helm 3 时,我们需要添加一个 stable 镜像仓库,因为默认状态下不会设置该仓库。


$ helm repo add stable https://kubernetes-charts.storage.googleapis.com"stable" has been added to your repositories
复制代码


$ helm repo updateHang tight while we grab the latest from your chart repositories......Successfully got an update from the "stable" chart repositoryUpdate Complete. ⎈ Happy Helming!⎈
复制代码


$ helm repo listNAME    URLstable  https://kubernetes-charts.storage.googleapis.com
复制代码


Helm 配置完成后,我们可以开始安装 prometheus-operator


$ kubectl create namespace monitoringnamespace/monitoring created
复制代码


$ helm install --namespace monitoring demo stable/prometheus-operatormanifest_sorter.go:192: info: skipping unknown hook: "crd-install"manifest_sorter.go:192: info: skipping unknown hook: "crd-install"manifest_sorter.go:192: info: skipping unknown hook: "crd-install"manifest_sorter.go:192: info: skipping unknown hook: "crd-install"manifest_sorter.go:192: info: skipping unknown hook: "crd-install"manifest_sorter.go:192: info: skipping unknown hook: "crd-install"NAME: demoLAST DEPLOYED: Sat Mar 14 09:40:35 2020NAMESPACE: monitoringSTATUS: deployedREVISION: 1NOTES:The Prometheus Operator has been installed. Check its status by running:  kubectl --namespace monitoring get pods -l "release=demo"
Visit https://github.com/coreos/prometheus-operator for instructions on howto create & configure Alertmanager and Prometheus instances using the Operator.
复制代码


规则


除了监控之外,Prometheus 还让我们创建触发告警的规则。这些规则基于 Prometheus 的表达式语言。只要满足条件,就会触发告警并将其发送到 Alertmanager。之后,我们会看到规则的具体形式。


我们回到 demo。Helm 完成部署之后,我们可以检查已经创建了什么 pod:


$ kubectl -n monitoring get podsNAME                                                   READY   STATUS    RESTARTS   AGEalertmanager-demo-prometheus-operator-alertmanager-0   2/2     Running   0          61sdemo-grafana-5576fbf669-9l57b                          3/3     Running   0          72sdemo-kube-state-metrics-67bf64b7f4-4786k               1/1     Running   0          72sdemo-prometheus-node-exporter-ll8zx                    1/1     Running   0          72sdemo-prometheus-node-exporter-nqnr6                    1/1     Running   0          72sdemo-prometheus-node-exporter-sdndf                    1/1     Running   0          72sdemo-prometheus-operator-operator-b9c9b5457-db9dj      2/2     Running   0          72sprometheus-demo-prometheus-operator-prometheus-0       3/3     Running   1          50s
复制代码


为了从 web 浏览器中访问 Prometheus 和 Alertmanager,我们需要使用 port 转发。


由于本例中使用的是 GCP 实例,并且所有的 kubectl 命令都从该实例运行,因此我们使用实例的外部 IP 地址访问资源。


$ kubectl port-forward --address 0.0.0.0 -n monitoring prometheus-demo-prometheus-operator-prometheus-0 9090  >/dev/null 2>&1 &
复制代码


$ kubectl port-forward --address 0.0.0.0 -n monitoring alertmanager-demo-prometheus-operator-alertmanager-0 9093  >/dev/null 2>&1 &
复制代码



“Alert”选项卡向我们展示了所有当前正在运行/已配置的告警。也可以通过查询名称为 prometheusrules 的 CRD 从 CLI 进行检查:



$ kubectl -n monitoring get prometheusrulesNAME                                                            AGEdemo-prometheus-operator-alertmanager.rules                     3m21sdemo-prometheus-operator-etcd                                   3m21sdemo-prometheus-operator-general.rules                          3m21sdemo-prometheus-operator-k8s.rules                              3m21sdemo-prometheus-operator-kube-apiserver-error                   3m21sdemo-prometheus-operator-kube-apiserver.rules                   3m21sdemo-prometheus-operator-kube-prometheus-node-recording.rules   3m21sdemo-prometheus-operator-kube-scheduler.rules                   3m21sdemo-prometheus-operator-kubernetes-absent                      3m21sdemo-prometheus-operator-kubernetes-apps                        3m21sdemo-prometheus-operator-kubernetes-resources                   3m21sdemo-prometheus-operator-kubernetes-storage                     3m21sdemo-prometheus-operator-kubernetes-system                      3m21sdemo-prometheus-operator-kubernetes-system-apiserver            3m21sdemo-prometheus-operator-kubernetes-system-controller-manager   3m21sdemo-prometheus-operator-kubernetes-system-kubelet              3m21sdemo-prometheus-operator-kubernetes-system-scheduler            3m21sdemo-prometheus-operator-node-exporter                          3m21sdemo-prometheus-operator-node-exporter.rules                    3m21sdemo-prometheus-operator-node-network                           3m21sdemo-prometheus-operator-node-time                              3m21sdemo-prometheus-operator-node.rules                             3m21sdemo-prometheus-operator-prometheus                             3m21sdemo-prometheus-operator-prometheus-operator                    3m21s
复制代码


我们也可以检查位于 prometheus 容器中 prometheus-operator Pod 中的物理文件。


$ kubectl -n monitoring exec -it prometheus-demo-prometheus-operator-prometheus-0 -- /bin/shDefaulting container name to prometheus.Use 'kubectl describe pod/prometheus-demo-prometheus-operator-prometheus-0 -n monitoring' to see all of the containers in this pod.
复制代码


在容器中,我们可以检查规则的存储路径:


/prometheus $ ls /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0/monitoring-demo-prometheus-operator-alertmanager.rules.yaml                    monitoring-demo-prometheus-operator-kubernetes-system-apiserver.yamlmonitoring-demo-prometheus-operator-etcd.yaml                                  monitoring-demo-prometheus-operator-kubernetes-system-controller-manager.yamlmonitoring-demo-prometheus-operator-general.rules.yaml                         monitoring-demo-prometheus-operator-kubernetes-system-kubelet.yamlmonitoring-demo-prometheus-operator-k8s.rules.yaml                             monitoring-demo-prometheus-operator-kubernetes-system-scheduler.yamlmonitoring-demo-prometheus-operator-kube-apiserver-error.yaml                  monitoring-demo-prometheus-operator-kubernetes-system.yamlmonitoring-demo-prometheus-operator-kube-apiserver.rules.yaml                  monitoring-demo-prometheus-operator-node-exporter.rules.yamlmonitoring-demo-prometheus-operator-kube-prometheus-node-recording.rules.yaml  monitoring-demo-prometheus-operator-node-exporter.yamlmonitoring-demo-prometheus-operator-kube-scheduler.rules.yaml                  monitoring-demo-prometheus-operator-node-network.yamlmonitoring-demo-prometheus-operator-kubernetes-absent.yaml                     monitoring-demo-prometheus-operator-node-time.yamlmonitoring-demo-prometheus-operator-kubernetes-apps.yaml                       monitoring-demo-prometheus-operator-node.rules.yamlmonitoring-demo-prometheus-operator-kubernetes-resources.yaml                  monitoring-demo-prometheus-operator-prometheus-operator.yamlmonitoring-demo-prometheus-operator-kubernetes-storage.yaml                    monitoring-demo-prometheus-operator-prometheus.yaml
复制代码


为了详细了解如何将这些规则加载到 Prometheus 中,请检查 Pod 的详细信息。我们可以看到用于 prometheus 容器的配置文件是 etc/prometheus/config_out/prometheus.env.yaml。该配置文件向我们展示了文件的位置或重新检查 yaml 的频率设置。


$ kubectl -n monitoring describe pod prometheus-demo-prometheus-operator-prometheus-0
复制代码


完整命令输出如下:


Name:           prometheus-demo-prometheus-operator-prometheus-0Namespace:      monitoringPriority:       0Node:           gke-c-7dkls-default-0-c6ca178a-gmcq/10.132.0.15Start Time:     Wed, 11 Mar 2020 18:06:47 +0000Labels:         app=prometheus                controller-revision-hash=prometheus-demo-prometheus-operator-prometheus-5ccbbd8578                prometheus=demo-prometheus-operator-prometheus                statefulset.kubernetes.io/pod-name=prometheus-demo-prometheus-operator-prometheus-0Annotations:    <none>Status:         RunningIP:             10.40.0.7IPs:            <none>Controlled By:  StatefulSet/prometheus-demo-prometheus-operator-prometheusContainers:  prometheus:    Container ID:  docker://360db8a9f1cce8d72edd81fcdf8c03fe75992e6c2c59198b89807aa0ce03454c    Image:         quay.io/prometheus/prometheus:v2.15.2    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:914525123cf76a15a6aaeac069fcb445ce8fb125113d1bc5b15854bc1e8b6353    Port:          9090/TCP    Host Port:     0/TCP    Args:      --web.console.templates=/etc/prometheus/consoles      --web.console.libraries=/etc/prometheus/console_libraries      --config.file=/etc/prometheus/config_out/prometheus.env.yaml      --storage.tsdb.path=/prometheus      --storage.tsdb.retention.time=10d      --web.enable-lifecycle      --storage.tsdb.no-lockfile      --web.external-url=http://demo-prometheus-operator-prometheus.monitoring:9090      --web.route-prefix=/    State:       Running      Started:   Wed, 11 Mar 2020 18:07:07 +0000    Last State:  Terminated      Reason:    Error      Message:    caller=main.go:648 msg="Starting TSDB ..."level=info ts=2020-03-11T18:07:02.185Z caller=web.go:506 component=web msg="Start listening for connections" address=0.0.0.0:9090level=info ts=2020-03-11T18:07:02.192Z caller=head.go:584 component=tsdb msg="replaying WAL, this may take awhile"level=info ts=2020-03-11T18:07:02.192Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0level=info ts=2020-03-11T18:07:02.194Z caller=main.go:663 fs_type=EXT4_SUPER_MAGIClevel=info ts=2020-03-11T18:07:02.194Z caller=main.go:664 msg="TSDB started"level=info ts=2020-03-11T18:07:02.194Z caller=main.go:734 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yamllevel=info ts=2020-03-11T18:07:02.194Z caller=main.go:517 msg="Stopping scrape discovery manager..."level=info ts=2020-03-11T18:07:02.194Z caller=main.go:531 msg="Stopping notify discovery manager..."level=info ts=2020-03-11T18:07:02.194Z caller=main.go:553 msg="Stopping scrape manager..."level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:814 component="rule manager" msg="Stopping rule manager..."level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:820 component="rule manager" msg="Rule manager stopped"level=info ts=2020-03-11T18:07:02.194Z caller=main.go:513 msg="Scrape discovery manager stopped"level=info ts=2020-03-11T18:07:02.194Z caller=main.go:527 msg="Notify discovery manager stopped"level=info ts=2020-03-11T18:07:02.194Z caller=main.go:547 msg="Scrape manager stopped"level=info ts=2020-03-11T18:07:02.197Z caller=notifier.go:598 component=notifier msg="Stopping notification manager..."level=info ts=2020-03-11T18:07:02.197Z caller=main.go:718 msg="Notifier manager stopped"level=error ts=2020-03-11T18:07:02.197Z caller=main.go:727 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"
Exit Code: 1 Started: Wed, 11 Mar 2020 18:07:02 +0000 Finished: Wed, 11 Mar 2020 18:07:02 +0000 Ready: True Restart Count: 1 Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6 Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120 Environment: <none> Mounts: /etc/prometheus/certs from tls-assets (ro) /etc/prometheus/config_out from config-out (ro) /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw) /prometheus from prometheus-demo-prometheus-operator-prometheus-db (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro) prometheus-config-reloader: Container ID: docker://de27cdad7067ebd5154c61b918401b2544299c161850daf3e317311d2d17af3d Image: quay.io/coreos/prometheus-config-reloader:v0.37.0 Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:5e870e7a99d55a5ccf086063efd3263445a63732bc4c04b05cf8b664f4d0246e Port: <none> Host Port: <none> Command: /bin/prometheus-config-reloader Args: --log-format=logfmt --reload-url=http://127.0.0.1:9090/-/reload --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml State: Running Started: Wed, 11 Mar 2020 18:07:04 +0000 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 25Mi Requests: cpu: 100m memory: 25Mi Environment: POD_NAME: prometheus-demo-prometheus-operator-prometheus-0 (v1:metadata.name) Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro) rules-configmap-reloader: Container ID: docker://5804e45380ed1b5374a4c2c9ee4c9c4e365bee93b9ccd8b5a21f50886ea81a91 Image: quay.io/coreos/configmap-reload:v0.0.1 Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057 Port: <none> Host Port: <none> Args: --webhook-url=http://127.0.0.1:9090/-/reload --volume-dir=/etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 State: Running Started: Wed, 11 Mar 2020 18:07:06 +0000 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 25Mi Requests: cpu: 100m memory: 25Mi Environment: <none> Mounts: /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro)Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled TrueVolumes: config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-demo-prometheus-operator-prometheus Optional: false tls-assets: Type: Secret (a volume populated by a Secret) SecretName: prometheus-demo-prometheus-operator-prometheus-tls-assets Optional: false config-out: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> prometheus-demo-prometheus-operator-prometheus-rulefiles-0: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-demo-prometheus-operator-prometheus-rulefiles-0 Optional: false prometheus-demo-prometheus-operator-prometheus-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> demo-prometheus-operator-prometheus-token-jvbrr: Type: Secret (a volume populated by a Secret) SecretName: demo-prometheus-operator-prometheus-token-jvbrr Optional: falseQoS Class: BurstableNode-Selectors: <none>Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300sEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m51s default-scheduler Successfully assigned monitoring/prometheus-demo-prometheus-operator-prometheus-0 to gke-c-7dkls-default-0-c6ca178a-gmcq Normal Pulling 4m45s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/prometheus/prometheus:v2.15.2" Normal Pulled 4m39s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/prometheus/prometheus:v2.15.2" Normal Pulling 4m36s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/prometheus-config-reloader:v0.37.0" Normal Pulled 4m35s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/prometheus-config-reloader:v0.37.0" Normal Pulling 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/configmap-reload:v0.0.1" Normal Started 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus-config-reloader Normal Created 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus-config-reloader Normal Pulled 4m33s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/configmap-reload:v0.0.1" Normal Created 4m32s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus Normal Created 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container rules-configmap-reloader Normal Started 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container rules-configmap-reloader Normal Pulled 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Container image "quay.io/prometheus/prometheus:v2.15.2" already present on machine Normal Started 4m31s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus
复制代码


让我们清理默认规则,使得我们可以更好地观察我们将要创建的那个规则。以下命令将删除所有规则,但会留下 monitoring-demo-prometheus-operator-alertmanager.rules。


$ kubectl -n monitoring delete prometheusrules $(kubectl -n monitoring get prometheusrules | grep -v alert)
复制代码


$ kubectl -n monitoring get prometheusrulesNAME                                          AGEdemo-prometheus-operator-alertmanager.rules   8m53s
复制代码


请注意:我们只保留一条规则是为了让 demo 更容易。但是有一条规则,你绝对不能删除,它位于 monitoring-demo-prometheus-operator-general.rules.yaml 中,被称为看门狗。该告警总是处于触发状态,其目的是确保整个告警流水线正常运转。


让我们从 CLI 中检查我们留下的规则并将其与我们将在浏览器中看到的进行比较。


$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rulesName:         demo-prometheus-operator-alertmanager.rulesNamespace:    monitoringLabels:       app=prometheus-operator              chart=prometheus-operator-8.12.1              heritage=Tiller              release=demoAnnotations:  prometheus-operator-validated: trueAPI Version:  monitoring.coreos.com/v1Kind:         PrometheusRuleMetadata:  Creation Timestamp:  2020-03-11T18:06:25Z  Generation:          1  Resource Version:    4871  Self Link:           /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules  UID:                 6a84dbb0-feba-4f17-b3dc-4b6486818bc0Spec:  Groups:    Name:  alertmanager.rules    Rules:      Alert:  AlertmanagerConfigInconsistent      Annotations:        Message:  The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync.      Expr:       count_values("config_hash", alertmanager_config_hash{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="demo-prometheus-operator-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "$1", "name", "(.*)") != 1      For:        5m      Labels:        Severity:  critical      Alert:       AlertmanagerFailedReload      Annotations:        Message:  Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}.      Expr:       alertmanager_config_last_reload_successful{job="demo-prometheus-operator-alertmanager",namespace="monitoring"} == 0      For:        10m      Labels:        Severity:  warning      Alert:       AlertmanagerMembersInconsistent      Annotations:        Message:  Alertmanager has not found all other members of the cluster.      Expr:       alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}  != on (service) GROUP_LEFT()count by (service) (alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"})      For:  5m      Labels:        Severity:  criticalEvents:            <none>
复制代码



让我们移除所有默认告警并创建一个我们自己的告警:


$ kubectl -n monitoring edit prometheusrules demo-prometheus-operator-alertmanager.rulesprometheusrule.monitoring.coreos.com/demo-prometheus-operator-alertmanager.rules edited
复制代码


我们的自定义告警如下所示:


$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rulesName:         demo-prometheus-operator-alertmanager.rulesNamespace:    monitoringLabels:       app=prometheus-operator              chart=prometheus-operator-8.12.1              heritage=Tiller              release=demoAnnotations:  prometheus-operator-validated: trueAPI Version:  monitoring.coreos.com/v1Kind:         PrometheusRuleMetadata:  Creation Timestamp:  2020-03-11T18:06:25Z  Generation:          3  Resource Version:    18180  Self Link:           /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules  UID:                 6a84dbb0-feba-4f17-b3dc-4b6486818bc0Spec:  Groups:    Name:  alertmanager.rules    Rules:      Alert:  PodHighCpuLoad      Annotations:        Message:  Alertmanager has found {{ $labels.instance }} with CPU too high      Expr:       rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m])  > 0.04      For:        1m      Labels:        Severity:  criticalEvents:            <none>
复制代码



以下是我们创建的告警的选项:


  • annotation:描述告警的信息标签集。

  • expr:由PromQL写的表达式

  • for:可选参数,设置了之后会告诉Prometheus在定义的时间段内告警是否处于active状态。仅在此定义时间后才会触发告警。

  • label:可以附加到告警的额外标签。如果你想了解更多关于告警的信息,可以访问:

  • https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/


现在我们已经完成了 Prometheus 告警的设置,让我们配置 Alertmanager,使得我们能够通过电子邮件获得告警通知。Alertmanager 的配置位于 Kubernetes secret 对象中。


$ kubectl get secrets -n monitoringNAME                                                        TYPE                                  DATA   AGEalertmanager-demo-prometheus-operator-alertmanager          Opaque                                1      32mdefault-token-x4rgq                                         kubernetes.io/service-account-token   3      37mdemo-grafana                                                Opaque                                3      32mdemo-grafana-test-token-p6qnk                               kubernetes.io/service-account-token   3      32mdemo-grafana-token-ff6nl                                    kubernetes.io/service-account-token   3      32mdemo-kube-state-metrics-token-vmvbr                         kubernetes.io/service-account-token   3      32mdemo-prometheus-node-exporter-token-wlnk9                   kubernetes.io/service-account-token   3      32mdemo-prometheus-operator-admission                          Opaque                                3      32mdemo-prometheus-operator-alertmanager-token-rrx4k           kubernetes.io/service-account-token   3      32mdemo-prometheus-operator-operator-token-q9744               kubernetes.io/service-account-token   3      32mdemo-prometheus-operator-prometheus-token-jvbrr             kubernetes.io/service-account-token   3      32mprometheus-demo-prometheus-operator-prometheus              Opaque                                1      31mprometheus-demo-prometheus-operator-prometheus-tls-assets   Opaque                                0      31m
复制代码


我们只对 alertmanager-demo-prometheus-operator-alertmanager 感兴趣。让我们看一下:


kubectl -n monitoring get secret alertmanager-demo-prometheus-operator-alertmanager -o yamlapiVersion: v1data:  alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==kind: Secretmetadata:  creationTimestamp: "2020-03-11T18:06:24Z"  labels:    app: prometheus-operator-alertmanager    chart: prometheus-operator-8.12.1    heritage: Tiller    release: demo  name: alertmanager-demo-prometheus-operator-alertmanager  namespace: monitoring  resourceVersion: "3018"  selfLink: /api/v1/namespaces/monitoring/secrets/alertmanager-demo-prometheus-operator-alertmanager  uid: 6baf6883-f690-47a1-bb49-491935956c22type: Opaque
复制代码


alertmanager.yaml 字段是由 base64 编码的,让我们看看:


$ echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==' | base64 --decodeglobal:  resolve_timeout: 5mreceivers:- name: "null"route:  group_by:  - job  group_interval: 5m  group_wait: 30s  receiver: "null"  repeat_interval: 12h  routes:  - match:      alertname: Watchdog    receiver: "null"
复制代码


正如我们所看到的,这是默认的 Alertmanager 配置。你也可以在 Alertmanager UI 的 Status 选项卡中查看此配置。接下来,我们来对它进行一些更改——在本例中为发送邮件:


$ cat alertmanager.yamlglobal:  resolve_timeout: 5mroute:  group_by: [Alertname]  # Send all notifications to me.  receiver: demo-alert  group_wait: 30s  group_interval: 5m  repeat_interval: 12h  routes:  - match:      alertname: DemoAlertName    receiver: 'demo-alert'
receivers:- name: demo-alert email_configs: - to: your_email@gmail.com from: from_email@gmail.com # Your smtp server address smarthost: smtp.gmail.com:587 auth_username: from_email@gmail.com auth_identity: from_email@gmail.com auth_password: 16letter_generated token # you can use gmail account password, but better create a dedicated token for this headers: From: from_email@gmail.com Subject: 'Demo ALERT'
复制代码


首先,我们需要对此进行编码:


$  cat alertmanager.yaml | base64 -w0
复制代码


我们获得编码输出后,我们需要在我们将要应用的 yaml 文件中填写它:


cat alertmanager-secret-k8s.yamlapiVersion: v1data:  alertmanager.yaml: <paste here de encoded content of alertmanager.yaml>kind: Secretmetadata:  name: alertmanager-demo-prometheus-operator-alertmanager  namespace: monitoringtype: Opaque
复制代码


$ kubectl apply -f alertmanager-secret-k8s.yamlWarning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl applysecret/alertmanager-demo-prometheus-operator-alertmanager configured
复制代码


该配置将会自动重新加载并在 UI 中显示更改。



接下来,我们部署一些东西来对其进行监控。对于本例而言,一个简单的 nginx deployment 已经足够:


$ cat nginx-deployment.yamlapiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2kind: Deploymentmetadata:  name: nginx-deploymentspec:  selector:    matchLabels:      app: nginx  replicas: 3 # tells deployment to run 2 pods matching the template  template:    metadata:      labels:        app: nginx    spec:      containers:      - name: nginx        image: nginx:1.7.9        ports:        - containerPort: 80
复制代码


$ kubectl apply -f nginx-deployment.yamldeployment.apps/nginx-deployment created
复制代码


根据配置 yaml,我们有 3 个副本:


$ kubectl get podsNAME                                READY   STATUS    RESTARTS   AGEnginx-deployment-5754944d6c-7g6gq   1/1     Running   0          67snginx-deployment-5754944d6c-lhvx8   1/1     Running   0          67snginx-deployment-5754944d6c-whhtr   1/1     Running   0          67s
复制代码


在 Prometheus UI 中,使用我们为告警配置的相同表达式:


rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m])
复制代码


我们可以为这些 Pod 检查数据,所有 Pod 的值应该为 0。



让我们在其中一个 pod 中添加一些负载,然后来看看值的变化,当值大于 0.04 时,我们应该接收到告警:


$  kubectl exec -it nginx-deployment-5754944d6c-7g6gq -- /bin/sh# yes > /dev/null
复制代码


该告警有 3 个阶段:


  • Inactive:不满足告警触发条件

  • Pending:条件已满足

  • Firing:触发告警


我们已经看到告警处于 inactive 状态,所以在 CPU 上添加一些负载,以观察到剩余两种状态:




告警一旦触发,将会在 Alertmanager 中显示:



Alertmanger 配置为当我们收到告警时发送邮件。所以此时,如果我们检查收件箱,会看到类似以下内容:



总结


我们知道监控的重要性,但是如果没有告警,它将是不完整的。发生问题时,告警可以立即通知我们,让我们立即了解到系统出现了问题。而 Prometheus 涵盖了这两个方面:既有监控解决方案又通过 Alertmanager 组件发出告警。本文中,我们看到了如何在 Prometheus 配置中定义告警以及告警在触发时如何到达 Alertmanager。然后根据 Alertmanager 的定义/集成,我们收到了一封电子邮件,其中包含触发的告警的详细信息(也可以通过 Slack 或 PagerDuty 发送)。


2020 年 5 月 25 日 16:41568

评论

发布
暂无评论
发现更多内容

Java异常面试题(2020最新版)

Java架构师迁哥

食堂就餐卡系统设计 UML 练习

escray

学习 极客大学架构师训练营 UML

git的几种实用操作(合并代码与暂存复原代码)

良知犹存

git

QPS、TPS、RT、并发数、吞吐量理解和性能优化深入思考

艾小仙

架构 编程语言

Spring Boot CLI 介绍

hungxy

Spring Boot Spring Boot CLI

TCP和HTTP中的KeepAlive机制总结

陈德伟

nginx TCP 性能 网络 HTTP

Copy攻城狮辛酸史:含泪“一分钟”跑通MindSpore的LeNet模型

华为云开发者社区

学习 程序员 mindspore

从想当亿万富翁到职场、创业、爱情、抑郁、学医学武,我的程序人生

陆陆通通

Java 创业 程序员 爱情 程序员生活

区块链支付系统开发技术方案,USDT支付系统搭建

13530558032

未来已来!全球一流科技盛会——云栖大会9月17日线上隆重举办

北柯

数字货币交易所技术开发,交易所源码

13530558032

区块链技术智能合约有哪些实际的应用场景

CECBC区块链专委会

智能合约 区块链技术

OBS鉴权实现的宝典秘籍,速拿!

华为云开发者社区

OBS 签名

入行架构师之前,这7项技能你要先了解一下

华为云开发者社区

架构 架构设计 架构师

实战案例丨GaussDB for DWS如何识别坏味道的SQL

华为云开发者社区

数据库 sql 算子

银行数仓体系发展之路

易观大数据

Spring-boot 单元测试

陈靓-哲露

智慧公安重点人员管控系统平台开发,智慧警务系统

13530558032

天猫成立房产部门,利用区块链承载交易多项服务功能

CECBC区块链专委会

区块链 房地产

本以为自己MySQL够牛逼了,直到亲自去阿里受虐了一次!

Java架构师迁哥

数字资产钱包开发方案,区块链数字钱包软件源码

13530558032

为什么企业需要CRM系统?CRM的作用及其重要性分析

力软.net/java开发平台

软件开发 信息化 CRM

Java-技术专题-AQS和Volatile和Synchronized实现原理

李浩宇/Alex

同城快递订单系统架构设计方案

周冬辉

卧槽!牛逼了!40K+Star!Mall电商实战项目开源回忆录!附源码、教程合集

云流

学习 架构师 计算机 程序员成长

LeetCode题解:622. 设计循环队列,使用双向链表,JavaScript,详细注释

Lee Chen

LeetCode 前端进阶训练营

跨专业学习6个月,成功上岸阿里|滴滴,分享学习路线供大家参考

小Q

Java 学习 架构 面试 基础

数字经济时代来临 区块链护航数字资产安全

CECBC区块链专委会

金融 数字时代

AWS在线技术峰会2020探班回顾,四大看点不容错过

小红豆

云计算 AI 云原生 金融 医疗

HTTP必知必会

陈靓-哲露

当代开发者的好帮手,浅析.NET敏捷开发框架的优势与特点

Philips

敏捷开发 软件开发 .net core 开发工具

如何使用Prometheus配置自定义告警规则-InfoQ