Prometheus VictoriaMetrics 单机与集群方案


VictoriaMetrics

一、VictoriaMetrics 简介

VictoriaMetrics 两个角色,1、时序数据库,2、监控解决方案

1. 优缺点

从时序数据库来上来看,它的性能足够优异

  • 查询性能:> InfluxDB、TimescaleDB(据说是 20 倍,仅当参考)
  • 内存使用:< InfluxDB、Prometheus、Thanos(少 5 倍以上)
  • 空间占用:< InfluxDB、Prometheus、Thanos(少 5 倍以上)

从监控解决方案来上来看,相较于 Prometheus 有以下优点

  • 提供全局(多实例)查询视图
  • 支持水平扩容、高可用
  • 支持多租户

由于 VictoriaMetrics 性能优于 Prometheus,且提供了更为丰富的功能,以及成熟的集群方案,所以现阶段很是火热

VictoriaMetrics 也不全是优点,上手后也发现一些不足

  • 图形化很简陋,很多页面直接就是 json 返回
  • 告警功能不如 AlertManager 丰富
  • 没有 WAL 日志,突然断电可能会丢失部分数据

2. 单机 & 集群

VictoriaMetrics 主要有两种使用场景

  • 单节点版:采集数据点(data points)低于 100w/s,官方推荐单节点版,单节点方案通常会使用到以下组件
    • victoria-metrics:负责提供时序数据的查询、存储
    • vmalert:负责从 victoria-metrics 获取数据评估 告警规则记录规则,产生的数据一般也会写到 victoria-metrics
    • alertmanager:负责处理 vmalert 发过来的告警
    • grafana:渲染指标数据,生成监控大盘
  • 集群版
    • vmagent:负责收集指标数据,支持提供 pull、push 获取数据、兼容 prometheus的 scraping target、relabeling 配置
    • vmselect:提供查询入口,执行用户输入查询,从各 vmstorage 节点获取所需数据
    • vminsert:提供 remote write 接口,接收 Prometheusvmagent 刮取的数据,并根据 指标名称 及其 标签一致性哈希 将其分散存储到 vmstorage 节点
    • vmstorage:存储原始数据,并返回指定标签过滤器在给定时间范围内的查询数据

二、VictoriaMetrics 单节点

单节点常用方案

方案一:Prometheus + AlertManager + Grafana + Webhook

当前架构

方案二:Prometheus + VictoriaMetrics + AlertManager + Grafana + Webhook

使用方式 1:Prometheus 刮取数据,然后通过 remote write 将数据发送 VictoriaMetrics 保存

方案三:VictoriaMetrics + AlertManager + Grafana + Webhook

使用方式 2:使用 VictoriaMetrics 刮取数据存储本地,使用 vmalert 评估 record rule、alert rule,生成持久化指标数据,以及告警规则判断

手动部署

从 方案一 调整为 方案三,在不影响基本功能的同时监控方案的性能与容量,为了加深理解,下面以二进制方式一步步的迁移部署

1. 安装 VictoriaMetrics、vmalert

VictoriaMetrics 二进制文件可以从 Github Release 下载

$ wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.90.0/victoria-metrics-linux-amd64-v1.90.0.tar.gz
$ tar xf victoria-metrics-linux-amd64-v1.90.0.tar.gz --transform 's/victoria-metrics-prod/victoria-metrics/'

这里除了 victoria-metrics 还会用到 vmalert,不过 官方文档Github 都没提供下载地址,按照官方文档执行 make 编译时遇到错误

$ git clone https://github.com/VictoriaMetrics/VictoriaMetrics
$ cd VictoriaMetrics
$ make vmalert

错误信息如下

APP_NAME=vmalert make app-local
make[1]: Entering directory `/tmp/VictoriaMetrics'
CGO_ENABLED=1 go build  -ldflags "-X 'github.com/VictoriaMetrics/VictoriaMetrics/lib/buildinfo.Version=vmalert-20230411-091838-heads-master-0-g0e1e0b0'" -o bin/vmalert github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert
app/vmalert/web.go:4:2: cannot find package "." in:
	/tmp/VictoriaMetrics/vendor/embed
vendor/github.com/mattn/go-runewidth/runewidth.go:7:2: found packages uniseg (doc.go) and main (gen_breaktest.go) in /tmp/VictoriaMetrics/vendor/github.com/rivo/uniseg
make[1]: *** [app-local] Error 1
make[1]: Leaving directory `/tmp/VictoriaMetrics'
make: *** [vmalert] Error 2

看了下 go.mod

$ cat go.mod                             
module github.com/VictoriaMetrics/VictoriaMetrics

go 1.19

require (
	cloud.google.com/go/storage v1.30.1
	github.com/Azure/azure-sdk-for-go/sdk/azcore v1.4.0
	github.com/Azure/azure-sdk-for-go/sdk/storage/azblob v1.0.0
	github.com/VictoriaMetrics/fastcache v1.12.1
    # ...

我这边版本太低了,升级 golang 版本

$ go version
go version go1.19.8 linux/amd64

切换代码版本,确保与 victoria-metrics 保持一致

☁  VictoriaMetrics [master]  git tag -l "v1.9*" 
v1.9.0
v1.90.0
v1.90.0-cluster
☁  VictoriaMetrics [v1.90.0] make vmalert             
APP_NAME=vmalert make app-local
make[1]: Entering directory `/tmp/VictoriaMetrics'
CGO_ENABLED=1 go build  -ldflags "-X 'github.com/VictoriaMetrics/VictoriaMetrics/lib/buildinfo.Version=vmalert-20230411-093246-v1.90.0-0-gb5d18c0'" -o bin/vmalert github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert
make[1]: Leaving directory `/tmp/VictoriaMetrics'

编译成功,目标文件在 bin/ 目录下

☁  VictoriaMetrics [v1.90.0] ls bin    
vmalert
☁  VictoriaMetrics [v1.90.0] ./bin/vmalert --version    
vmalert-20230411-093246-v1.90.0-0-gb5d18c0

顺手把其他的也一并编译出来

$ for i in vmagent vmauth vmctl vmgateway vmbackup vmrestore
do
make $i
done
$ ls bin 
vmagent  vmalert  vmauth  vmbackup  vmctl  vmrestore

编译完成,最后简单调整服务目录规范

$ tree /usr/local/victoria-metrics
/usr/local/victoria-metrics
├── bin
│   ├── victoria-metrics
│   ├── vmagent
│   ├── vmalert
│   ├── vmauth
│   ├── vmbackup
│   ├── vmctl
│   └── vmrestore
├── config
│   └── victoria-metrics.yml
├── data
└── rules

4 directories, 8 files

systemd 服务单元脚本,基于源代码目录下 package/rpm/victoriametrics.service 稍作修改

$ cat > /etc/systemd/system/vm.service << EOF
[Unit]
Description=VictoriaMetrics
After=network.target

[Service]
Type=simple
User=victoria-metrics
# 服务在单位时间内(StartLimitInterval)最大重启次数 
StartLimitBurst=5
StartLimitInterval=10
# 重启间隔,异常后等待 1 秒再启动
RestartSec=1
# 当退出码非 0 时,执行服务重启
Restart=on-failure
ExecStart=/usr/local/victoria-metrics/bin/victoria-metrics \
    -promscrape.config=/usr/local/victoria-metrics/config/victoria-metrics.yml \ # victoria-metrics 服务配置,基本兼容 Prometheus 配置
    -storageDataPath=/usr/local/victoria-metrics/data \ # 数据目录
    -promscrape.configCheckInterval=60s \               # 重载配置间隔
    -promscrape.consulSDCheckInterval=60s \             # 各类服务发现机制的检查间隔配置
    -promscrape.dnsSDCheckInterval=60s \
    -promscrape.dockerSDCheckInterval=60s \
    -promscrape.fileSDCheckInterval=60s \
    -promscrape.httpSDCheckInterval=60s \
    -promscrape.kubernetesSDCheckInterval=60s \
    -retentionPeriod=60d \                              # 保留最近 60 天的数据
    --httpListenAddr=:9290                              # 服务监听
ExecStop=/bin/kill -s SIGTERM $MAINPID


# 进程文件描述上限、服务最大可打开进程(线程)上限
LimitNOFILE=65536
LimitNPROC=32000


[Install]
WantedBy=multi-user.target
EOF

victoria-metrics 配置文件

$ cat > /usr/local/victoria-metrics/config/victoria-metrics.yml << EOF
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "victoria-metrics"
    static_configs:
      - targets: ["127.0.0.1:9290"]
EOF

特别说明下,victoria-metrics 并非百分百兼容 prometheus 配置,如下面的错误提示,所以还是要自己多留意一下

Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 4: field evaluation_interval not found in type promscrape.GlobalConfig
Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 7: field alerting not found in type promscrape.Config
Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 12: field rule_files not found in type promscrape.Config
Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 67: field refresh_interval not found in type promscrape.FileSDConfig
Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 73: field refresh_interval not found in type dns.SDConfig
Apr 10 17:56:28 bj-tencent-lhins-1 victoria-metrics[15020]: line 79: field refresh_interval not found in type dns.SDConfig; pass -promscrape.config.strictParse=false command-line flag for ignoring unknown fields in yaml config

创建用户 victoria-metrics

$ useradd victoria-metrics

更改目录属主

$ chown -R victoria-metrics:victoria-metrics /usr/local/victoria-metrics

设置服务自启

$ sc enable vm --now

检查服务状态

$ sc status vm

victoria-metrics 配置完毕后,查看 WebUI

配置 vmalertsystemd 服务单元脚本

$ cat > /etc/systemd/system/vmalert.service << EOF
[Unit]
Description=VictoriaMetrics Alert
After=network.target

[Service]
Type=simple
User=victoria-metrics
StartLimitBurst=5
StartLimitInterval=10
RestartSec=1
Restart=on-failure
ExecStart=/usr/local/victoria-metrics/bin/vmalert\
  -evaluationInterval=15s \        # 评估间隔
  -rule=/usr/local/victoria-metrics/rules/*.yml \  # 记录规则、告警规则目录
  -datasource.url=127.0.0.1:9290 \                 # victoria-metrics 服务地址,读取数据评估规则
  -remoteWrite.url=127.0.0.1:9290 \                # victoria-metrics 服务地址,写入记录数据
  -notifier.url=127.0.0.1:9193 \                   # alertmanager 服务地址
  -httpListenAddr=0.0.0.0:9291                     # 服务监听地址

ExecStop=/bin/kill -s SIGTERM $MAINPID
LimitNOFILE=65536
LimitNPROC=32000

[Install]
WantedBy=multi-user.target
EOF

设置服务自启

$ sc enable vmalert --now

检查服务状态

$ sc status vmalert

查看 WebUI

2. 配置 AlertManager、Promoter

这两个服务是已经配置好的,所以不需要修改,vmalert 直接指过来就行,安装也比较简单,这里只是贴下相关配置

AlertManager 服务配置:alertmanager.yml

global:
  # 当 alertmanager 持续多长时间未接收到告警后标记告警状态为 resolved
  resolve_timeout: 5m
  ##################
  #      SMTP
  ##################
  smtp_smarthost: smtp.163.com:465
  smtp_from: xxx@163.com
  smtp_auth_username: xxxx@163.com
  smtp_auth_identity: xxxx@163.com
  smtp_auth_password: 
  smtp_require_tls: false


# 告警路由
route:
  # 这里的标签列表是接收到报警信息后的重新分组标签
  # 如,接收到的报警信息里有许多具有 instance=A 和 alertname=xx 这样标签的报警信息将会批量被聚合到一个分组里面
  group_by: ['instance', 'alertname']
  group_wait: 1s
  group_interval: 10s
  # 警报重复间隔,每2分钟重复一次警报
  repeat_interval: 2m
  # 警报接收端,这里配置为下面定义的钩子
  # receiver: 'default-mail-receiver'
  # receiver: 'ops_wechat'
  receiver: 'default-mail-receiver'
  routes:
  - match_re:
      # severity: ^(error|critical)$
      severity: ^(critical)$
    receiver: promoter-webhook-dingtalk
    continue: true
  - match:
      severity: warning
    receiver: promoter-webhook-wechat

receivers:
  - name: default-mail-receiver
    email_configs:
      - to: "xxx@163.com"
        send_resolved: true
  - name: 'promoter-webhook-dingtalk'
    webhook_configs:
    - url: "http://127.0.0.1:9195/dingtalk/send"
      send_resolved: true
  - name: 'promoter-webhook-wechat'
    webhook_configs:
    - url: "http://127.0.0.1:9195/wechat/send"
      send_resolved: true

Promoter 服务配置:promoter.yml

---
global:
  # victoria-metrics 服务地址,执行 PromQL 语句,用以渲染图片
  prometheus_url: http://172.17.0.1:9290
  dingtalk_api_token: xxx
  dingtalk_api_secret: xxx
  wechat_api_secret: xxx-DxXFQQVF8Z1eirmD8
  wechat_api_corp_id: xxx

s3:
  # 阿里云 AK、SK,用以上传监控渲染图片
  access_key: "xxx"
  secret_key: "xxx"
  # endpoint: "oss-cn-beijing-internal.aliyuncs.com"
  endpoint: "oss-cn-beijing.aliyuncs.com"
  region: "cn-beijing"
  bucket: "xxxx"


receivers:
  - name: dingtalk
    dingtalk_config:
      message_type: markdown
      markdown:
        title: '{{ template "dingtalk.default.title" . }}'
        text: '{{ template "dingtalk.default.content" . }}'
      at:
        atMobiles: [ "135xxx" ]
        isAtAll: true
  - name: wechat
    wechat_config:
      message_type: markdown
      message: '{{ template "wechat.default.message" . }}'
      to_user: "@all"
      agent_id: 1000002
3. Grafana 切换数据源

创建新数据源,使用 victoria-metrics 服务地址端口

切换数据源的方式如下:

  1. Share 分享,导出仪表盘 JSON 文件
  2. import 仪表盘,选择 victoria-metrics 数据源

由于 node_exporter 示例仪表盘,已经发布到 Grafana,直接输入 18435,修改可能存在的重名、UID问题,选择 victoria-metrics 数据源,然后倒入即可

查看新仪表盘的渲染展示情况

4. Promoter 告警测试

通过 HostHighTmpfsUsed 指标测试

- alert: HostHighTmpfsUsed
  # tmpfs 内存使用超过 1 GiB
  # expr: node:mem:tmpfs_used > 1024
  # 为方便测试调整阈值为 200
  expr: node:mem:tmpfs_used > 200
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.instance }} 节点 tmpfs 使用率过高 !"
    description: "最近一分钟内 {{ $labels.instance }} 节点 tmpfs 使用率过高 !\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"

/run 生成 300M 文件

$ cd /run; dd if=/dev/urandom of=testfile count=300 bs=1M

访问 vmalert WebUI 查看是否有活跃告警

等片刻,Pending 变为 Firing

企业微信告警

另一个相关指标的钉钉告警

ansible 部署

环境版本

由于只是示例 demo,没做太多兼容性考虑,低版本的 ansible 执行可能会出错,例如缺少 filter 之类的,建议使用前确认环境及版本,下面是我个人的环境

$ cat /etc/redhat-release           
CentOS Linux release 7.9.2009 (Core)

$ uname -r
5.4.239-1.el7.elrepo.x86_64

$ pip freeze | grep ansible                        
ansible==7.4.0
ansible-core==2.14.4


$ ansible --version
ansible [core 2.14.4]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.pyenv/versions/3.9.16/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.pyenv/versions/3.9.16/bin/ansible
  python version = 3.9.16 (main, Apr 13 2023, 00:38:03) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] (/root/.pyenv/versions/3.9.16/bin/python3.9)
  jinja version = 3.1.2
  libyaml = True
下载角色
$ ansible-galaxy install git+git@e.coding.net:LotusChing/prometheus/victoria-metrics-role.git

下载路径为配置文件 /etc/ansible/ansible.cfg 中 roles_path 参数定义

如果 roles_path 配置项未定义,则按照下载至默认 roles 路径列表中的第一个

$ ansible-config dump |grep DEFAULT_ROLES_PATH

DEFAULT_ROLES_PATH(default) = [u'/root/.ansible/roles', u'/usr/share/ansible/roles', u'/etc/ansible/roles']
$ cd /root/.ansible/roles
$ ls
victoria-metrics-role
修改配置
  • 根据个人环境修改 inventory 文件

  • 按照需求修改各软件参数配置 roles/single-server/vars/main.yml

执行角色
  1. 全部安装(vm、node_exporter、alertmanager、promoter、grafana)
    $ ansible-playbook -i inventory roles/single-server/setup.yml
  2. 只安装特定的包
    $ ansible-playbook -i inventory --tags=node_exporter,vm roles/single-server/setup.yml
  3. 排除特定的包
    $ ansible-playbook -i inventory --skip-tags=node_exporter roles/single-server/setup.yml

执行角色安装

$ ansible-playbook -i inventory roles/single-server/setup.yml

PLAY [localhost] ***********************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [localhost]

TASK [../single-server : include_tasks] ************************************************************************************************************
included: /root/.ansible/roles/victoria-metrics-role/roles/single-server/tasks/install_node_exporter.yml for localhost

TASK [../single-server : 获取 node_exporter 二进制包文件状态] **************************************************************************************
ok: [localhost]

TASK [../single-server : 下载 node_exporter 二进制包] **********************************************************************************************
skipping: [localhost]

TASK [../single-server : 配置 Chrony NTP 开放规则] *************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建 node_exporter 服务目录] **********************************************************************************************
changed: [localhost]

TASK [../single-server : 解压 node_exporter 安装包] ************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 node_exporter 目录属主] **********************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 node_exporter 环境变量] **********************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 node_exporter 服务启动脚本] ******************************************************************************************
ok: [localhost]

TASK [../single-server : 重载系统配置 daemon_reload] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 启动 node_exporter 并配置自启] ********************************************************************************************
changed: [localhost]

TASK [../single-server : include_tasks] ************************************************************************************************************
included: /root/.ansible/roles/victoria-metrics-role/roles/single-server/tasks/install_promoter.yml for localhost

TASK [../single-server : 获取 promoter 二进制包文件状态] *******************************************************************************************
ok: [localhost]

TASK [../single-server : 下载 promoter 二进制包] ***************************************************************************************************
skipping: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建 promoter 服务目录] ***************************************************************************************************
changed: [localhost]

TASK [../single-server : 解压 promoter 安装包] *****************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 promoter 目录属主] ***************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 promoter 环境变量] ***************************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 promoter 服务启动脚本] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 promoter 服务配置] ***************************************************************************************************
changed: [localhost]

TASK [../single-server : 修正 promoter 服务配置] ***************************************************************************************************
changed: [localhost]

TASK [../single-server : 重载系统配置 daemon_reload] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 启动 promoter 并配置自启] *************************************************************************************************
changed: [localhost]

TASK [../single-server : include_tasks] ************************************************************************************************************
included: /root/.ansible/roles/victoria-metrics-role/roles/single-server/tasks/install_alertmanager.yml for localhost

TASK [../single-server : 获取 alertmanager 二进制包文件状态] ***************************************************************************************
ok: [localhost]

TASK [../single-server : 下载 alertmanager 二进制包] ***********************************************************************************************
skipping: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建 alertmanager 服务目录] ***********************************************************************************************
changed: [localhost]

TASK [../single-server : 解压 alertmanager 安装包] *************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 alertmanager 目录属主] ***********************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 alertmanager 环境变量] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 alertmanager 服务启动脚本] *******************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 alertmanager 服务配置] ***********************************************************************************************
changed: [localhost]

TASK [../single-server : 重载系统配置 daemon_reload] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 启动 alertmanager 并配置自启] *********************************************************************************************
changed: [localhost]

TASK [../single-server : include_tasks] ************************************************************************************************************
included: /root/.ansible/roles/victoria-metrics-role/roles/single-server/tasks/install_victoriametrics.yml for localhost

TASK [../single-server : 获取 VictoriaMetrics 二进制包文件状态] ************************************************************************************
ok: [localhost]

TASK [../single-server : 下载 victoria-metrics 二进制包] *******************************************************************************************
skipping: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建 victoria-metrics 服务目录] *******************************************************************************************
changed: [localhost]

TASK [../single-server : 解压 victoria-metrics 安装包] *********************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 victoria-metrics 环境变量] *******************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 victoria-metrics 服务启动脚本] ***************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 vmalert 服务脚本] ****************************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 victoria-metrics 服务配置] *******************************************************************************************
changed: [localhost]

TASK [../single-server : 拉取 rules 规则库] ********************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 victoria-metrics 目录属主] *******************************************************************************************
changed: [localhost]

TASK [../single-server : 重载系统配置 daemon_reload] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 启动 victoria-metrics 并配置自启] *****************************************************************************************
changed: [localhost]

TASK [../single-server : 启动 vmalert 并配置自启] **************************************************************************************************
changed: [localhost]

TASK [../single-server : include_tasks] ************************************************************************************************************
included: /root/.ansible/roles/victoria-metrics-role/roles/single-server/tasks/install_grafana.yml for localhost

TASK [../single-server : 获取 grafana 二进制包文件状态] ********************************************************************************************
ok: [localhost]

TASK [../single-server : 下载 grafana 二进制包] ****************************************************************************************************
skipping: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建服务运行用户组] *******************************************************************************************************
ok: [localhost]

TASK [../single-server : 创建 grafana 服务目录] ****************************************************************************************************
changed: [localhost]

TASK [../single-server : 解压 grafana 安装包] ******************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 grafana 目录属主] ****************************************************************************************************
changed: [localhost]

TASK [../single-server : 设置 grafana 环境变量] ****************************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 grafana 服务启动脚本] ************************************************************************************************
ok: [localhost]

TASK [../single-server : 生成 grafana 服务配置] ****************************************************************************************************
changed: [localhost]

TASK [../single-server : 重载系统配置 daemon_reload] ***********************************************************************************************
ok: [localhost]

TASK [../single-server : 启动 grafana 并配置自启] **************************************************************************************************
changed: [localhost]

RUNNING HANDLER [../single-server : print-node-exporter-info] **************************************************************************************
ok: [localhost] => {
    "msg": [
        "######### Node Exporter #########",
        "HOME: /usr/local/node_exporter",
        "URL: 192.168.0.101:9193",
        "#################################"
    ]
}

RUNNING HANDLER [../single-server : print-alertmanager-info] ***************************************************************************************
ok: [localhost] => {
    "msg": [
        "######### AlertManager #########",
        "HOME: /usr/local/alertmanager/",
        "DATA: /usr/local/alertmanager/data",
        "URL: 192.168.0.101:9193",
        "################################"
    ]
}

RUNNING HANDLER [../single-server : print-grafana-info] ********************************************************************************************
ok: [localhost] => {
    "msg": [
        "######### Grafana #########",
        "HOME: /usr/local/grafana",
        "Config: /usr/local/grafana/conf/defaults.ini",
        "User: admin",
        "Password: admin123",
        "URL: 192.168.0.101:9300",
        "###########################"
    ]
}

RUNNING HANDLER [../single-server : print-promoter-info] *******************************************************************************************
ok: [localhost] => {
    "msg": [
        "######### Promoter #########",
        "HOME: /usr/local/promoter",
        "Config: /usr/local/promoter/config.yml",
        "Listen addr: http://0.0.0.0:9194",
        "Default Media: dingtalk",
        "Enable Media: [dingtalk: True, wechat: False, mail: True]",
        "###########################"
    ]
}

RUNNING HANDLER [../single-server : print-victoria-metrics-info] ***********************************************************************************
ok: [localhost] => {
    "msg": [
        "######### VictoriaMetrics #########",
        "HOME: /usr/local/victoria-metrics",
        "DATA: /usr/local/victoria-metrics/data",
        "Config: /usr/local/victoria-metrics/config/victoria-metrics.yml",
        "VictoriaMetrics URL: 192.168.0.101:9290",
        "VMAlert URL: 192.168.0.101:9291",
        "Node_export target: ['127.0.0.1:9110']",
        "###################################"
    ]
}

PLAY RECAP *****************************************************************************************************************************************
localhost                  : ok=70   changed=27   unreachable=0    failed=0    skipped=5    rescued=0    ignored=0   
检查确认
  1. 检查服务启动,观察有无 error、failed 输出

    $ systemctl status <unit>
  2. 访问 VictoriaMetrics WebUI,检查 targets、以及 node_exporter 规则、记录规则是否生成等

    node_load1
    node:cpu:cpu_usage
  3. 访问 Grafana WebUI,默认用户名/密码 admin/admin123,创建数据源、导入 18434

  4. 测试 AlertManagerPromoter 告警

    AlertManager、Promoter 告警测试,观察 vmalert 有无产生 firing 告警

    $ sc stop node_exporter

​ OK,钉钉收到告警

三、VictoriaMetrics 集群

集群组件

VictoriaMetrics 集群模式主要由下面三个核心服务组成:

  • vmagent:负责收集指标数据,支持提供 pull、push 获取数据、兼容 prometheusscraping targetrelabeling 配置
  • vminsert:提供 remote write 接口,接收 Prometheusvmagent 刮取的数据,并根据 指标名称 及其 标签一致性哈希 将其分散存储到 vmstorage 节点
  • vmstorage:存储原始数据,并返回指定标签过滤器在给定时间范围内的查询数据
    • -storageDataPath 数据目录可用空间少于 -storage.minFreeDiskSpaceBytesvmstorage 节点进入只读模式,vminsert 节点将写请求路由到其他 vmstorage 节点
  • vmselect:执行查询,从各 vmstorage 节点获取所需数据

集群扩容

上门四个是 VictoriaMetrics 集群最核心的四个个组件,扩容以下组件可以提高集群性能级稳定性,具体如下:

组件 纵向扩容(CPU、内存) 横向扩容(添加节点)
vmagent 提高节点抓取性能 提高集群抓取性能及容量,将大量目标的抓取压力分散到多个 vmagent 实例
vmselect 提高复杂查询的性能(以及处理大量的时间序列和大量的原始样本) 提高集群稳定性,提高查询的最大速度,传入的并发请求可能会在更多的 vmselect 节点之间进行拆分
vminsert 通常不需要纵向扩容 提高集群稳定性,提高数据接收的最大速度,数据写入请求可以在更多的 vminsert 节点之间进行拆分
vmstorage 增加集群可以处理的活跃时间序列的数量 提高集群稳定性,提高对高流失率的时间序列的查询性能

数据复制

默认,VictoriaMetrics 数据复制依赖 -storageDataPath 指向的数据目录存储完成

除此外,还可以通过 -replicationFactor=N 启用多份写入,通过将每份数据存入 N 个不同的节点,实现数据复制。在查询时会同时查询多个节点,去重后返回给客户端

集群部署

申请 ecs

# 调动 阿里云 OpenAPI 创建三台抢占式 ECS 实例
$ python aliyun-ecs-sdk.py apply

Success. Instance creation succeed. InstanceIds: i-hp3dvstf3hxe3zfi5pi3, i-hp3dvstf3hxe3zfi5pi4, i-hp3dvstf3hxe3zfi5pi5
Instance boot successfully: node00001 39.104.25.70 172.16.0.14
Instance boot successfully: node00002 39.104.21.230 172.16.0.13
Instance boot successfully: node00003 39.104.21.216 172.16.0.12

初始化 k8s

使用 ansible 部署并初始化 Kubernetes 集群

$ ap -i alicloud.py --tags=kubernetes setup.yml

输出信息

PLAY [系统初始化] **********************************************************************************************************************************

PLAY [部署 Container Runtime] **********************************************************************************************************************

PLAY [部署 Kubernetes 集群] ************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [i_hp3dvstf3hxe3zfi5pi5]
ok: [i_hp3dvstf3hxe3zfi5pi3]
ok: [i_hp3dvstf3hxe3zfi5pi4]

TASK [kubernetes : include_tasks] ******************************************************************************************************************
included: /prodata/scripts/ansibleLearn/ansible-k8s-role/roles/kubernetes/tasks/install_kubernetes.yml for i_hp3dvstf3hxe3zfi5pi3, i_hp3dvstf3hxe3zfi5pi4, i_hp3dvstf3hxe3zfi5pi5 => (item=install_kubernetes.yml)

TASK [kubernetes : 启动并设置 kubelet 自启] ********************************************************************************************************
ok: [i_hp3dvstf3hxe3zfi5pi3]
ok: [i_hp3dvstf3hxe3zfi5pi5]
ok: [i_hp3dvstf3hxe3zfi5pi4]

TASK [kubernetes : 生成 kubeadm 集群初始化配置] ****************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi4]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 拉取 Kubernetes 集群组件镜像] ***************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]
changed: [i_hp3dvstf3hxe3zfi5pi4]

TASK [kubernetes : 执行 Kubernetes 集群初始化] *****************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 获取 join 信息] *****************************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 拉取 join 脚本到主控机] *********************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : worker 节点加入集群] ************************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi3]
changed: [i_hp3dvstf3hxe3zfi5pi4]
changed: [i_hp3dvstf3hxe3zfi5pi5]

TASK [kubernetes : Master 节点基础配置] ************************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 下载 flannel 网络插件清单文件] **************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 部署 flannel 网络插件] **********************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 禁用默认 containerd cni 配置] ***************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi3]
fatal: [i_hp3dvstf3hxe3zfi5pi5]: FAILED! => {"changed": true, "cmd": "mv /etc/cni/net.d/10-containerd-net.conflist /etc/cni/net.d/10-containerd-net.conflist.bak ;\nifconfig cni0 down ;\nip link delete cni0\n", "delta": "0:00:00.126983", "end": "2023-04-14 15:08:55.157618", "msg": "non-zero return code", "rc": 1, "start": "2023-04-14 15:08:55.030635", "stderr": "cni0: ERROR while getting interface flags: No such device\nCannot find device \"cni0\"", "stderr_lines": ["cni0: ERROR while getting interface flags: No such device", "Cannot find device \"cni0\""], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [i_hp3dvstf3hxe3zfi5pi4]: FAILED! => {"changed": true, "cmd": "mv /etc/cni/net.d/10-containerd-net.conflist /etc/cni/net.d/10-containerd-net.conflist.bak ;\nifconfig cni0 down ;\nip link delete cni0\n", "delta": "0:00:00.130082", "end": "2023-04-14 15:08:55.495147", "msg": "non-zero return code", "rc": 1, "start": "2023-04-14 15:08:55.365065", "stderr": "cni0: ERROR while getting interface flags: No such device\nCannot find device \"cni0\"", "stderr_lines": ["cni0: ERROR while getting interface flags: No such device", "Cannot find device \"cni0\""], "stdout": "", "stdout_lines": []}
...ignoring

TASK [kubernetes : Flush handlers] *****************************************************************************************************************

TASK [kubernetes : Flush handlers] *****************************************************************************************************************

TASK [kubernetes : Flush handlers] *****************************************************************************************************************

RUNNING HANDLER [kubernetes : daemon-reload] *******************************************************************************************************
ok: [i_hp3dvstf3hxe3zfi5pi3]

RUNNING HANDLER [kubernetes : kubelet-restart] *****************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi3]

RUNNING HANDLER [kubernetes : containerd-restart] **************************************************************************************************

TASK [kubernetes : containerd-restart] *************************************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5]

RUNNING HANDLER [kubernetes : containerd-restart] **************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 重建 coredns 容器] **************************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi3]

TASK [kubernetes : 修复 scheduler control-manager 端口配置问题] ************************************************************************************
skipping: [i_hp3dvstf3hxe3zfi5pi4] => (item=/etc/kubernetes/manifests/kube-scheduler.yaml) 
skipping: [i_hp3dvstf3hxe3zfi5pi4] => (item=/etc/kubernetes/manifests/kube-controller-manager.yaml) 
skipping: [i_hp3dvstf3hxe3zfi5pi4]
skipping: [i_hp3dvstf3hxe3zfi5pi5] => (item=/etc/kubernetes/manifests/kube-scheduler.yaml) 
skipping: [i_hp3dvstf3hxe3zfi5pi5] => (item=/etc/kubernetes/manifests/kube-controller-manager.yaml) 
skipping: [i_hp3dvstf3hxe3zfi5pi5]
changed: [i_hp3dvstf3hxe3zfi5pi3] => (item=/etc/kubernetes/manifests/kube-scheduler.yaml)
changed: [i_hp3dvstf3hxe3zfi5pi3] => (item=/etc/kubernetes/manifests/kube-controller-manager.yaml)

RUNNING HANDLER [kubernetes : daemon-reload] *******************************************************************************************************
ok: [i_hp3dvstf3hxe3zfi5pi3]

RUNNING HANDLER [kubernetes : kubelet-restart] *****************************************************************************************************
changed: [i_hp3dvstf3hxe3zfi5pi3]

PLAY RECAP *****************************************************************************************************************************************
i_hp3dvstf3hxe3zfi5pi3     : ok=19   changed=14   unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
i_hp3dvstf3hxe3zfi5pi4     : ok=7    changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=1   
i_hp3dvstf3hxe3zfi5pi5     : ok=7    changed=4    unreachable=0    failed=0    skipped=8    rescued=0    ignored=1   

从 master 获取 kube-config 到本地

$ ansible -i alicloud.py 'i-hp3dvstf3hxe3zfi5pi3' -m fetch -a "src=/root/.kube/config dest=/root/.kube/config flat=true";sed -r -
i "s/server:.*/server: https:\/\/k8s-master001.yo-yo.fun:6443/g" ~/.kube/config

检查集群状态

$ kc get cs; kc get node; kc get pods -A; kc get --raw='/readyz?verbose'
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS    MESSAGE                         ERROR
scheduler            Healthy   ok                              
controller-manager   Healthy   ok                              
etcd-0               Healthy   {"health":"true","reason":""}   

NAME           STATUS   ROLES                  AGE     VERSION
k8s-master01   Ready    control-plane,master   3m21s   v1.22.2
k8s-worker02   Ready    <none>                 2m56s   v1.22.2
k8s-worker03   Ready    <none>                 2m56s   v1.22.2


NAMESPACE      NAME                                   READY   STATUS    RESTARTS        AGE
kube-flannel   kube-flannel-ds-98f7m                  1/1     Running   0               2m52s
kube-flannel   kube-flannel-ds-bkl5d                  1/1     Running   0               2m52s
kube-flannel   kube-flannel-ds-k62v6                  1/1     Running   0               2m52s
kube-system    coredns-6548b55d4b-86ph9               1/1     Running   0               2m45s
kube-system    coredns-6548b55d4b-p8p7z               1/1     Running   0               2m45s
kube-system    etcd-k8s-master01                      1/1     Running   0               3m18s
kube-system    kube-apiserver-k8s-master01            1/1     Running   0               3m13s
kube-system    kube-controller-manager-k8s-master01   1/1     Running   2 (2m34s ago)   2m30s
kube-system    kube-proxy-6pg64                       1/1     Running   0               2m56s
kube-system    kube-proxy-9d796                       1/1     Running   0               3m3s
kube-system    kube-proxy-k8wv6                       1/1     Running   0               2m56s
kube-system    kube-scheduler-k8s-master01            1/1     Running   2 (2m34s ago)   2m30s


[+]ping ok
[+]log ok
[+]etcd ok
[+]informer-sync ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]shutdown ok
readyz check passed

OK,集群初始化完成

下列是接下来会用到的相关镜像,可以提前拉取下,避免后面再等着

ctr --namespace k8s.io images pull  docker.io/vicoriametrics/vmstorage:v1.77.0-cluster
ctr --namespace k8s.io images pull  docker.io/prom/prometheus:v2.35.0
ctr --namespace k8s.io images pull  docker.io/prom/node-exporter:v1.5.0
ctr --namespace k8s.io images pull  docker.io/grafana/grafana:9.4.7
ctr --namespace k8s.io images pull  docker.io/victoriametrics/vmselect:v1.77.0-cluster
ctr --namespace k8s.io images pull  docker.io/victoriametrics/vminsert:v1.77.0-cluster
ctr --namespace k8s.io images pull  docker.io/victoriametrics/vmagent:v1.77.0
ctr --namespace k8s.io images pull  docker.io/victoriametrics/vmalert:v1.77.0
ctr --namespace k8s.io images pull  docker.io/prom/alertmanager:v0.25.0
ctr --namespace k8s.io images pull  docker.io/lotusching/promoter:latest

namespace

资源定义

# ns.yml
apiVersion: v1
kind: Namespace
metadata:
  # 命名空间名称
  name: kube-vm

创建资源

$ kc apply -f ns.yml    
namespace/kube-vm created

rbac

资源定义

# rbac.yml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vmagent-sa
  namespace: kube-vm

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: vmagent
rules:
  - apiGroups: ["", "networking.k8s.io", "extensions"]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - endpointslices
      - pods
      - app
      - ingresses
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - namespaces
      - configmaps
    verbs: ["get"]
  - nonResourceURLs: ["/metrics", "/metrics/resources"]
    verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: vmagent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: vmagent
subjects:
  - kind: ServiceAccount
    name: vmagent-sa
    namespace: kube-vm

创建资源,ServiceAccount 用以从 Kubernetes 自动发现监控目标,包括:Node、Pod、Service 等

$ kc apply -f rbac.yml    
serviceaccount/vmagent-sa created
clusterrole.rbac.authorization.k8s.io/vmagent created
clusterrolebinding.rbac.authorization.k8s.io/vmagent created

storageclass

资源定义

这里为了方面测试使用 LocalPath 作为本地存储

# storageclass.yml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
# 延迟绑定:等到第一个声明使用该 PVC 的 Pod 开始调度绑定
volumeBindingMode: WaitForFirstConsumer

创建资源

$ kc apply -f storageclass.yml    
storageclass.storage.k8s.io/local-storage created

node_exporter

资源定义

# node_exporter.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: kube-vm
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.5.0
          args:
            - --web.listen-address=0.0.0.0:9110
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host/root
            - --collector.filesystem.ignored-mount-points=^/(proc|var/lib/containerd/.+|/var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
            - --collector.textfile
            - --collector.netdev.device-exclude="^(lo|docker[0-9]|veth.+)$"
            - --collector.systemd
            - --collector.systemd.unit-whitelist="(docker|ssh).service"
            - --collector.conntrack
            - --collector.cpu
            - --collector.diskstats
            - --collector.filefd
            - --collector.filesystem
            - --collector.loadavg
            - --collector.meminfo
            - --collector.netdev
            - --collector.netstat
            - --collector.ntp
            - --collector.sockstat
            - --collector.stat
            - --collector.time
            - --collector.uname
            - --collector.vmstat
            - --collector.tcpstat
            - --collector.xfs
            - --collector.zfs
            - --no-collector.arp
            - --no-collector.bcache
            - --no-collector.bonding
            - --no-collector.buddyinfo
            - --no-collector.drbd
            - --no-collector.edac
            - --no-collector.entropy
            - --no-collector.hwmon
            - --no-collector.infiniband
            - --no-collector.interrupts
            - --no-collector.ipvs
            - --no-collector.ksmd
            - --no-collector.logind
            - --no-collector.mdadm
            - --no-collector.meminfo_numa
            - --no-collector.mountstats
            - --no-collector.nfs
            - --no-collector.nfsd
            - --no-collector.qdisc
            - --no-collector.runit
            - --no-collector.supervisord
            - --no-collector.timex
            - --no-collector.wifi
          ports:
            - containerPort: 9110
          env:
            - name: HOSTIP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
          resources:
            requests:
              cpu: 150m
              memory: 180Mi
            limits:
              cpu: 150m
              memory: 180Mi
          securityContext:
            runAsUser: 65534
            runAsNonRoot: true
          volumeMounts:
            - mountPath: /host/proc
              name: proc
            - mountPath: /host/sys
              name: sys
            - mountPath: /host/root
              name: root
              readOnly: true
              mountPropagation: HostToContainer
            - mountPath: /var/run/dbus/system_bus_socket
              name: system-dbus-socket
              readOnly: true

      tolerations:
        - operator: "Exists"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: root
          hostPath:
            path: /
        - name: system-dbus-socket
          hostPath:
            path: /var/run/dbus/system_bus_socket

创建资源

$ kc apply -f node-exporter.yml 
daemonset.apps/node-exporter created

$ kc get ds -n kube-vm 
NAME    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
node-exporter   3    3    2    3    2    kubernetes.io/os=linux   29s

vmstore

唯一真正有状态的服务,这里使用 LocalPath 类型 PV,节点亲和性选的 k8s-worker02,数据目录在 /data/k8s/vmstore,各 vmstorage 写入数据时,会写到各自 vmstore/<POD_NAME>

$ mkdir -p /data/k8s/vmstore

资源定义

# vmstore.yml
---
apiVersion: v1
kind: Service
metadata:
  # headless 的 service 名称
  name: cluster-vmstorage
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  type: ClusterIP
  # headless 无头 Service
  clusterIP: None
  selector:
    app: vmstorage
  ports:
    - port: 8482
      targetPort: http
      name: http
    - port: 8401
      targetPort: vmselect
      name: vmselect
    - port: 8400
      targetPort: vminsert
      name: vminsert
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: vmstore-local-pv
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 20Gi
  storageClassName: local-storage
  local:
    # 目录需要提前在 k8s-worker02 创建
    path: /data/k8s/vmstore
  persistentVolumeReclaimPolicy: Retain
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                # 节点亲和性选择 k8s-worker02
                - k8s-worker02
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vmstore-local-pvc
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  storageClassName: local-storage
  selector:
    matchLabels:
      app: vmstorage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vmstorage
  namespace: kube-vm
  labels:
    app: vmstorage
spec:
  serviceName: cluster-vmstorage
  selector:
    matchLabels:
      app: vmstorage
  replicas: 2
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: vmstorage
    spec:
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: vmstore-local-pvc
      containers:
        - name: vmstorage
          image: "victoriametrics/vmstorage:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: storage
              mountPath: /storage
          args:
            - "--retentionPeriod=60d"
            # 不同实例接受的数据写入到不同目录
            - "--storageDataPath=/storage/$(POD_NAME)"
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
            # 清理重复数据,与 scrape_interval 保持一致
            - --dedup.minScrapeInterval=15s
          ports:
            - name: http
              containerPort: 8482
            - name: vminsert
              containerPort: 8400
            - name: vmselect
              containerPort: 8401
          livenessProbe:
            failureThreshold: 10
            initialDelaySeconds: 30
            periodSeconds: 30
            tcpSocket:
              port: http
            timeoutSeconds: 5
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            httpGet:
              path: /health
              port: http

创建资源

$ kc apply -f vmstore.yml     
service/cluster-vmstorage created
persistentvolume/vmstore-local-pv created
persistentvolumeclaim/vmstore-local-pvc created
statefulset.apps/vmstorage created

观察状态

$ kc get sts -n kube-vm 
NAME    READY   AGE
vmstorage   2/2     81s

$ kc -n kube-vm logs -l app=vmstorage
{"ts":"2023-04-14T03:01:50.018Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:832","msg":"nothing to load from \"/storage/vmstorage-0/cache/next_day_metric_ids\""}
{"ts":"2023-04-14T03:01:50.035Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\"..."}
{"ts":"2023-04-14T03:01:50.048Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\" has been opened in 0.012 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:01:50.049Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\"..."}
{"ts":"2023-04-14T03:01:50.064Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\" has been opened in 0.016 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:01:50.094Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:92","msg":"successfully opened storage \"/storage/vmstorage-0\" in 0.093 seconds; partsCount: 0; blocksCount: 0; rowsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:01:50.094Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:152","msg":"accepting vmselect conns at 0.0.0.0:8401"}
{"ts":"2023-04-14T03:01:50.094Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:85","msg":"accepting vminsert conns at 0.0.0.0:8400"}
{"ts":"2023-04-14T03:01:50.094Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8482/"}
{"ts":"2023-04-14T03:01:50.095Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8482/debug/pprof/"}
{"ts":"2023-04-14T03:02:05.011Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:832","msg":"nothing to load from \"/storage/vmstorage-1/cache/next_day_metric_ids\""}
{"ts":"2023-04-14T03:02:05.030Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C1\"..."}
{"ts":"2023-04-14T03:02:05.043Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C1\" has been opened in 0.013 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:02:05.045Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C0\"..."}
{"ts":"2023-04-14T03:02:05.069Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C0\" has been opened in 0.023 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:02:05.091Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:92","msg":"successfully opened storage \"/storage/vmstorage-1\" in 0.104 seconds; partsCount: 0; blocksCount: 0; rowsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T03:02:05.092Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:85","msg":"accepting vminsert conns at 0.0.0.0:8400"}
{"ts":"2023-04-14T03:02:05.092Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8482/"}
{"ts":"2023-04-14T03:02:05.092Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:152","msg":"accepting vmselect conns at 0.0.0.0:8401"}
{"ts":"2023-04-14T03:02:05.092Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8482/debug/pprof/"}

vmselect

vmselect 负责提供数据查询接口,例如给 Grafana 查询渲染看板数据

vmselect 基本可以看做是无状态的,不过它本身提供了 cache 功能,这部分是带点状态的,但假如可以接受缓存丢失,数据查询回源的话,那么可以直接 deployment 部署,如果想要保存下来,也可以使用 statufulset 挂个 pv、pvc 进行部署

vmselect 服务最重要的参数:--storageNode=,通过该参数指定所有的 vmstorage 节点地址,由于 vmstorage 是用 StatefulSet 部署的,Pod 名称是固定的,所以这里使用的是 FQDN 形式访问 vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8401

资源定义

---
apiVersion: v1
kind: Service
metadata:
  name: vmselect
  namespace: kube-vm
  labels:
    app: vmselect
spec:
  type: NodePort
  selector:
    app: vmselect
  ports:
    - name: http
      port: 8481
      targetPort: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmselect
  namespace: kube-vm
  labels:
    app: vmselect
spec:
  selector:
    matchLabels:
      app: vmselect
  template:
    metadata:
      labels:
        app: vmselect
    spec:
      volumes:
        - name: cache-volume
          emptyDir: {}
      containers:
        - name: vmselect
          image: "victoriametrics/vmselect:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: cache-volume
              mountPath: /cache
          args:
            - "--cacheDataPath=/cache"
            # 逐一显式指明 vmstorage 节点地址
            - --storageNode=vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8401
            - --storageNode=vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8401
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
            # 清理重复数据,与 scrape_interval 保持一致
            - --dedup.minScrapeInterval=15s
          ports:
            - name: http
              containerPort: 8481
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            tcpSocket:
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

创建资源

$ kc apply -f vmselect.yml    
service/vmselect created
deployment.apps/vmselect created

观察状态

$ kc get deploy -n kube-vm -l app=vmselect
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
vmselect   1/1     1    1    24s

$ kc -n kube-vm logs -l app=vmselect
{"ts":"2023-04-14T03:04:52.146Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.prefix\"=\"VM_\""}
{"ts":"2023-04-14T03:04:52.146Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"loggerFormat\"=\"json\""}
{"ts":"2023-04-14T03:04:52.146Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"storageNode\"=\"vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8401,vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8401\""}
{"ts":"2023-04-14T03:04:52.146Z","level":"info","caller":"VictoriaMetrics/app/vmselect/main.go:74","msg":"starting netstorage at storageNodes [vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8401 vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8401]"}
{"ts":"2023-04-14T03:04:52.147Z","level":"info","caller":"VictoriaMetrics/app/vmselect/main.go:81","msg":"started netstorage in 0.000 seconds"}
{"ts":"2023-04-14T03:04:52.151Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2023-04-14T03:04:52.151Z","level":"info","caller":"VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:63","msg":"loading rollupResult cache from \"/cache/rollupResult\"..."}
{"ts":"2023-04-14T03:04:52.152Z","level":"info","caller":"VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:89","msg":"loaded rollupResult cache from \"/cache/rollupResult\" in 0.002 seconds; entriesCount: 0, sizeBytes: 0"}
{"ts":"2023-04-14T03:04:52.153Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8481/"}
{"ts":"2023-04-14T03:04:52.154Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8481/debug/pprof/"}

vminsert

vminsert 主要负责提供 remote write 接口,接收来自 vmagent、vmalert、prometheus 采集(生成)的数据,根据标签的一致性哈希,将数据分散存储到 vmstorage 节点,它是无状态的,所以当我们想要提高数据接收最大速度的时候,可以很简便的进行横向扩容

资源定义

---
apiVersion: v1
kind: Service
metadata:
  name: vminsert
  namespace: kube-vm
  labels:
    app: vminsert
spec:
  type: ClusterIP
  selector:
    app: vminsert
  ports:
    - name: http
      port: 8480
      targetPort: http

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vminsert
  namespace: kube-vm
  labels:
    app: vminsert
spec:
  selector:
    matchLabels:
      app: vminsert
  template:
    metadata:
      labels:
        app: vminsert
    spec:
      containers:
        - name: vminsert
          image: "victoriametrics/vminsert:v1.77.0-cluster"
          imagePullPolicy: "IfNotPresent"
          args:
            # 与 vmselect 一样,逐一指明 vmstore Pod 地址
            - --storageNode=vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8400
            - --storageNode=vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8400
            - --envflag.enable=true
            - --envflag.prefix=VM_
            - --loggerFormat=json
          ports:
            - name: http
              containerPort: 8480
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          livenessProbe:
            tcpSocket:
              port: http
            initialDelaySeconds: 5
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3

创建资源

$ kc apply -f vminsert.yml    
service/vminsert created
deployment.apps/vminsert created

观察状态

$ kc get deploy -n kube-vm -l app=vminsert
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
vminsert   1/1     1    1    16s

$ kc -n kube-vm logs -l app=vminsert    
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.prefix\"=\"VM_\""}
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"loggerFormat\"=\"json\""}
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"storageNode\"=\"vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8400,vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8400\""}
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/app/vminsert/main.go:78","msg":"initializing netstorage for storageNodes [vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8400 vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8400]..."}
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2023-04-14T03:08:26.447Z","level":"info","caller":"VictoriaMetrics/app/vminsert/main.go:91","msg":"successfully initialized netstorage in 0.001 seconds"}
{"ts":"2023-04-14T03:08:26.448Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8480/"}
{"ts":"2023-04-14T03:08:26.448Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8480/debug/pprof/"}
{"ts":"2023-04-14T03:08:26.653Z","level":"info","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:260","msg":"successfully dialed -storageNode=\"vmstorage-0.cluster-vmstorage.kube-vm.svc.cluster.local:8400\""}
{"ts":"2023-04-14T03:08:26.654Z","level":"info","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:260","msg":"successfully dialed -storageNode=\"vmstorage-1.cluster-vmstorage.kube-vm.svc.cluster.local:8400\""}

vmagent

vmagent 取代 prometheus,负责自动发现、采集指标数据(不包含 record 持久化指标),vmagent 稍微带一点状态,这是一位 vmagent 提供了一个提高健壮性的参数 -remoteWrite.tmpDataPath,参数作用是 当 vminsert 不可用时(无可用实例),暂时先写到本地,待 vminsert 恢复可用后,再逐步将本地同步到 vminsert

所以,这里要用 sts 部署,并挂上一个小点的 pv、pvc

资源定义 configMap,仅包含 vmagent 服务本身配置、自动发现配置,不包含 持久化规则、告警规则

# configmap-vmagent-config.yml
apiVersion: v1
kind: ConfigMap
metadata:
  # name: vmagent-config
  name: configmap-vmagent-config
  namespace: kube-vm
data:
  scrape.yml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 15s

    scrape_configs:
    - job_name: nodes
      kubernetes_sd_configs:
        - role: node
      relabel_configs:
      # 修改默认 10250 端口采集为 自定义的 node_exporter 9110 端口
      - source_labels: [__address__]
        regex: "(.*):10250"
        replacement: "${1}:9110"
        target_label: __address__
        action: replace
      # # 映射 Node 的 Label 标签
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    - job_name: apiserver
      scheme: https
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        # 根据正则过滤出 apiserver 服务组件的 endpoint
        regex: apiserver
        source_labels: [__meta_kubernetes_service_label_component]

    - job_name: cadvisor
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        # 跳过证书校验
        insecure_skip_verify: true
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - replacement: /metrics/cadvisor
        target_label: __metrics_path__

    - job_name: pod
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: drop
        regex: true
        source_labels:
        - __meta_kubernetes_pod_container_init
      # 用以判断注解提供的端口,是否匹配容器暴露的端口,避免编写时端口不一致,导致采集失败
      - action: keep_if_equal
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_port
        - __meta_kubernetes_pod_container_port_number
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      # 匹配 http、https 格式的协议头,替换到 __scheme__ 标签,用以抓取数据时使用正确的协议
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      # 匹配路径,替换到 __scheme__ 标签,用以抓取数据时使用正确的路径
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      # 通过注解获取地址、端口
      - action: replace
        # ([^:]+) 非:开头出现一到多次,匹配 IP 地址
        # (?::\d+)? 不保存子组,:\d+,匹配 :port 出现 0 到 1次
        # (\d+) 端口
        regex: ([^:]+)(?::\d+)?;(\d+)
        # 根据匹配分组生成新数据
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        # 使用包含新地址数据的 __address__ 标签采集数据
        target_label: __address__
      # 标签映射,将符合规则的多个标签,统一映射出来
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      # 生成包含 pod 名称的 pod 标签,用以标识数据属于那个 pod
      - source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      # 生成包含命名空间名称的 namespace 标签,用以标识数据属于那个 命名空间
      - source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      # 生成包含服务名称的 service 标签,用以标识数据属于那个服务
      - source_labels:
        - __meta_kubernetes_service_name
        target_label: service
      - replacement: ${1}
        source_labels:
        - __meta_kubernetes_service_name
        target_label: job
      # 生成包含节点名称的 node 标签,用以标识数据属于那个 宿主机节点
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node

vmagent 使用的是 LocalPath ,所以这里提前在 k8s-worker03 创建目录

$ mkdir -p /data/k8s/vmagent

资源定义

# vmagent.yml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: vmagent-pv
  namespace: kube-vm
  labels:
    app: vmagent
spec:
  # 动态存储类名称
  storageClassName: local-storage
  # 删除 PVC 的时候,PV和 不会被删除,需要手动删除
  persistentVolumeReclaimPolicy: Retain
  # 配置 PV 支持的访问模式
  accessModes:
    # 允许多个节点挂载
    - ReadWriteMany
  capacity:
    # 提供的容量
    storage: 2Gi
  # 宿主机路径
  local:
    path: /data/k8s/vmagent
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: ["k8s-worker03"]
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vmagent-pvc
  namespace: kube-vm
  labels:
    app: vmagent
spec:
  storageClassName: local-storage
  selector:
    matchLabels:
      app: vmagent
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi


---
apiVersion: v1
kind: Service
metadata:
  name: vmagent
  namespace: kube-vm
  annotations:
    # 设置注解,用以自动发现并采集自身指标
    prometheus.io/scrape: "true"
    prometheus.io/port: "8429"
spec:
  selector:
    app: vmagent
  # 无头服务,用以发现 vmagent 实例
  clusterIP: None
  ports:
    - name: http
      port: 8429
      targetPort: http
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vmagent
  namespace: kube-vm
  labels:
    app: vmagent
spec:
  replicas: 2
  serviceName: vmagent
  selector:
    matchLabels:
      app: vmagent
  template:
    metadata:
      labels:
        app: vmagent
    spec:
      serviceAccountName: vmagent-sa
      volumes:
        - name: config
          configMap:
            name: configmap-vmagent-config
        - name: tmpdata
          persistentVolumeClaim:
            claimName: vmagent-pvc
      containers:
        - name: agent
          image: victoriametrics/vmagent:v1.77.0
          imagePullPolicy: IfNotPresent
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: tmpdata
              mountPath: /tmpData
            - name: config
              mountPath: /config
          args:
            - -promscrape.config=/config/scrape.yml
            # 当后端 insert 节点不可用时,临时写到该目录,insert 恢复可用后,再继续同步
            # 由于使用的是 LocalPath 所以这里用 $(POD_NAME) 分开一下
            - -remoteWrite.tmpDataPath=/tmpData/$(POD_NAME)
            # vmagent 实例的数量
            - -promscrape.cluster.membersCount=2
            # - -promscrape.cluster.replicationFactor=2 # 可以配置副本数
            # vmagent 实例成员 ID
            - -promscrape.cluster.memberNum=$(POD_NAME)
            - -remoteWrite.url=http://vminsert:8480/insert/0/prometheus
            # 允许提供环境变量设置参数
            - -envflag.enable=true
            # 环境变量前缀设置
            - -envflag.prefix=VM_
            - -loggerFormat=json
          ports:
            - name: http
              containerPort: 8429

创建资源

$ kc apply -f configmap-vmagent-config.yml    
configmap/configmap-vmagent-config created

$ kc apply -f vmagent.yml    
persistentvolume/vmagent-pv created
persistentvolumeclaim/vmagent-pvc created
service/vmagent created
statefulset.apps/vmagent created

观察状态

$ kc get sts -n kube-vm -l app=vmagent
NAME    READY   AGE
vmagent   2/2     38s

$
{"ts":"2023-04-14T03:22:12.118Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:589","msg":"reloaded 7 objects from \"https://10.96.0.1:443/api/v1/endpoints\" in 0.015s; updated=0, removed=0, added=7, resourceVersion=\"3339\""}
{"ts":"2023-04-14T03:22:12.119Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:589","msg":"reloaded 7 objects from \"https://10.96.0.1:443/api/v1/services\" in 0.016s; updated=0, removed=0, added=7, resourceVersion=\"3339\""}
{"ts":"2023-04-14T03:22:12.120Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:589","msg":"reloaded 3 objects from \"https://10.96.0.1:443/api/v1/nodes\" in 0.016s; updated=0, removed=0, added=3, resourceVersion=\"3339\""}
{"ts":"2023-04-14T03:22:12.120Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:589","msg":"reloaded 22 objects from \"https://10.96.0.1:443/api/v1/pods\" in 0.017s; updated=0, removed=0, added=22, resourceVersion=\"3339\""}
{"ts":"2023-04-14T03:22:41.053Z","level":"info","caller":"VictoriaMetrics/lib/promscrape/scraper.go:393","msg":"kubernetes_sd_configs: added targets: 4, removed targets: 0; total targets: 4"}

配置完负责采集指标的 vmagent 后,通过 vmselect 提供 vmui 试着查询下指标数据

获取 vmselect 外部端口

$ kc get svc -n kube-vm

NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
vmselect   NodePort   10.103.106.255   <none>        8481:32515/TCP   21m

访问 WebUI(http://ip:32515/select/0/vmui),查询数据

同时,也可以到 亲和性节点 k8s-worker03 上查看数据目录

$ hostname
k8s-worker03

$ pwd
/data/k8s

$ tree vmagent 
vmagent
├── vmagent-0
│   └── persistent-queue
│       └── 1_F7AB42DA6C66E8E1
│      ├── 0000000000000000
│      └── metainfo.json
└── vmagent-1
    └── persistent-queue
    └── 1_F7AB42DA6C66E8E1
    ├── 0000000000000000
    └── metainfo.json

6 directories, 4 files

除此外,还可以观察下 vmstorage 的数据存储,这部分数据是存在 k8s-worker02 节点

$ tree vmstore -L 3
vmstore
├── vmstorage-0
│   ├── data
│   │   ├── big
│   │   ├── flock.lock
│   │   └── small
│   ├── flock.lock
│   ├── indexdb
│   │   ├── 1755ADF679474FBE
│   │   ├── 1755ADF679474FBF
│   │   └── snapshots
│   ├── metadata
│   │   └── minTimestampForCompositeIndex
│   └── snapshots
└── vmstorage-1
    ├── data
    │   ├── big
    │   ├── flock.lock
    │   └── small
    ├── flock.lock
    ├── indexdb
    │   ├── 1755ADF9F68785C0
    │   ├── 1755ADF9F68785C1
    │   └── snapshots
    ├── metadata
    │   └── minTimestampForCompositeIndex
    └── snapshots

20 directories, 6 files

OK,确实数据也过来了,按照 Pod 名称分别写到不同目录去了

alertmanager

资源定义 ConfigMap

这里仅做了基本配置,告警媒介走的是 一个开源的 webhook 服务 promoter

# configmap-alertmanager-config.yml

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap-alertmanager-config
  namespace: kube-vm
data:
  alertmanager.yml: |
    global:
      # 当 alertmanager 持续多长时间未接收到告警后标记告警状态为 resolved
      resolve_timeout: 5m
    # 告警路由
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签
      # 如,接收到的报警信息里有许多具有 instance=A 和 alertname=xx 这样标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['instance', 'alertname']
      group_wait: 1s
      group_interval: 10s
      # 警报重复间隔,每2分钟重复一次警报
      repeat_interval: 2m
      # 警报接收端,这里配置为下面定义的钩子
      receiver: 'promoter-webhook-wechat'
      routes:
      - match_re:
          # severity: ^(error|critical)$
          severity: ^(critical)$
        receiver: promoter-webhook-dingtalk
        continue: true
    receivers:
      - name: 'promoter-webhook-dingtalk'
        webhook_configs:
        - url: "http://promoter:9194/dingtalk/send"
          send_resolved: true
      - name: 'promoter-webhook-wechat'
        webhook_configs:
        - url: "http://promoter:9194/wechat/send"
          send_resolved: true

资源定义

apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: kube-vm
  labels:
    app: alertmanager
spec:
  selector:
    app: alertmanager
  type: ClusterIP
  ports:
    - port: 9193
      targetPort: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: kube-vm
  labels:
    app: alertmanager
spec:
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      volumes:
        - name: alertmanager-config
          configMap:
            name: configmap-alertmanager-config
      containers:
        - name: alertmanager
          image: prom/alertmanager:v0.25.0
          imagePullPolicy: IfNotPresent
          args:
            - "--config.file=/etc/alertmanager/alertmanager.yml"
          ports:
            - containerPort: 9093
              name: http
          volumeMounts:
            - mountPath: "/etc/alertmanager"
              name: alertmanager-config
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 100m
              memory: 256Mi

创建资源

$ kc apply -f configmap-alertmanager-config.yml
$ kc apply -f alertmanager.yml

$ kc -n kube-vm get deploy -l app=alertmanager
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
alertmanager   1/1     1    1    3m48s

promoter

promoter 配置文件 promoter-config.yml

---
global:
  dingtalk_api_token: xxx
  dingtalk_api_secret: xxx
  wechat_api_secret: xxx-xxx
  wechat_api_corp_id: xxx
s3:
  # 阿里云 OSS,用以保存生成的图片
  access_key: "xxx"
  secret_key: "xxx"
  # endpoint: "oss-cn-beijing-internal.aliyuncs.com"
  endpoint: "oss-cn-beijing.aliyuncs.com"
  region: "cn-beijing"
  bucket: "xxx"

receivers:
  - name: dingtalk
    dingtalk_config:
      message_type: markdown
      markdown:
        title: '{{ template "dingtalk.default.title" . }}'
        text: '{{ template "dingtalk.default.content" . }}'
      at:
        atMobiles: [ "138xxxx" ]
        isAtAll: true
  - name: wechat
    wechat_config:
      message_type: markdown
      message: '{{ template "wechat.default.message" . }}'
      to_user: "@all"
      agent_id: 1000002

生成 secret 密文 data

$ cat promoter-config.yml | base64

资源定义 Secret

# secret-promoter-config.yml
apiVersion: v1
kind: Secret
metadata:
  name: secret-promoter-config
  namespace: kube-vm
data:
  config.yml: |
    # 密文 data

创建 secret 对象

$ kc apply -f secret-promoter-config.yml

promoter 工作负载定义

apiVersion: v1
kind: Service
metadata:
  name: promoter
  namespace: kube-vm
  labels:
    app: promoter
spec:
  type: ClusterIP
  selector:
    app: promoter
  ports:
    - port: 9194
      protocol: TCP
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: promoter
  namespace: kube-vm
  labels:
    app: promoter
spec:
  selector:
    matchLabels:
      app: promoter
  template:
    metadata:
      labels:
        app: promoter
    spec:
      volumes:
        - name: promoter-config
          secret:
            secretName: secret-promoter-config
      containers:
        - name: promoter
          image: lotusching/promoter:latest
          imagePullPolicy: IfNotPresent
          command:
            - "/promoter/bin/promoter"
            - "--config.file=/etc/secret/config.yml"
          volumeMounts:
            - mountPath: /etc/secret
              name: promoter-config
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP

创建资源

$ kc apply -f promoter.yml
service/promoter created
deployment.apps/promoter created

$ kc get deploy -n kube-vm -l app=promoter    
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
promoter   1/1     1    1    7s


$ kc -n kube-vm logs -f -l app=promoter
ts=2023-04-14T03:39:25.733Z caller=main.go:58 level=info msg="Staring Promoter" version="(version=0.2.3, branch=HEAD, revision=0a9cf8fc9bd55d1d2d47d181867135914927c2fc)"
ts=2023-04-14T03:39:25.733Z caller=main.go:59 level=info build_context="(go=go1.17.8, user=root@91adc4eacff7, date=20220305-05:40:54)"
ts=2023-04-14T03:39:25.733Z caller=main.go:127 level=info component=configuration msg="Loading configuration file" file=/etc/secret/config.yml
ts=2023-04-14T03:39:25.733Z caller=main.go:138 level=info component=configuration msg="Completed loading of configuration file" file=/etc/secret/config.yml
ts=2023-04-14T03:39:25.735Z caller=main.go:88 level=info msg=Listening address=:8080

vmalert

vmalert 负责从 vmselect 查询数据,根据已载入的规则(记录规则、告警规则)进行数据评估,评估后,持久化记录这部分数据通过 vminsert 存入 vmstorage 组件

告警规则,如哟评估满足告警条件,则将通过 alertmanager 产生告警通知,alertmanager 根据告警策略配置,选择分组、抑制、以及路由到配置定义的媒介,经由 webhook 发送通知到用户

vmalert 组件重要的参数有以下几个

  • -rule:规则文件路径
  • -datasource.url:从哪里查询数据,用以生成持久化记录数据,也就是 vmselect 的地址
  • -remoteWrite.url:生成持久化记录数据后,写到哪,也就是 vminsert 的地址
  • -notifier.url:告警规则触发后通知谁,也就是 alertmanager 的地址
  • -evaluationInterval=15s:规则(持久化、告警)多久评估一次

vmalert ConfigMap 配置对象

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap-vmalert-rules
  namespace: kube-vm
data:
  node_records.yml: |+
    groups:
      - name: "node_rules"
        interval: 15s
        rules:
          #################
          #  CPU
          #################
          # 最近 1 分钟节点 CPU 使用率
          - record: node:cpu:cpu_usage
            expr: (1 - sum(irate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(irate(node_cpu_seconds_total[1m])) by (instance) )
          # 最近 1 分钟节点各 CPU 核心使用率
          - record: node:cpu:per_cpu_usage
            expr: (1 - sum(irate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance, cpu)  / sum(irate(node_cpu_seconds_total[1m])) by (instance, cpu))
          #################
          #  Memory
          #################
          # 节点 内存 使用率
          - record: node:mem:memory_usage
            expr: (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes)
          # tmpfs、devtmpfs 内存使用量(单位 MiB)
          - record: node:mem:tmpfs_used
            expr: (node_filesystem_size_bytes{fstype=~".*tmpfs"} - node_filesystem_free_bytes{fstype=~".*tmpfs"}) / 1024 / 1024
          # 最近一分钟内 slab 不可回收内存量的平均值(单位 MiB)
          - record: node:mem:slab_sunreclaim
            expr: avg_over_time(node_memory_SUnreclaim_bytes[1m]) / 1024 / 1024
          # 最近一分钟内 LRU list 中 不可释放内存量的平均值(单位 MiB)
          - record: node:mem:lru_unevictable
            expr: avg_over_time(node_memory_Unevictable_bytes[1m]) / 1024 / 1024
          #################
          #  Disk
          #################
          # 空间 已用百分比
          - record: node:disk:disk_space_usage
            expr: (1 - (node_filesystem_avail_bytes{fstype=~"ext.*|xfs|btrfs",device=~"/dev/vd.*"} / node_filesystem_size_bytes{fstype=~"ext.*|xfs|btrfs",device=~"/dev/vd.*"}))
          # Inode 已用百分比
          - record: node:disk:inode_space_usage
            expr: (1 - (node_filesystem_files_free{fstype="ext4"} / node_filesystem_files{fstype="ext4"}))
          #################
          #  DiskIO
          #################
          # 计算 1 分钟内平均每秒处理磁盘读请求数,对应 iostat -dxk 中的 r/s
          - record: node:disk:read_iops
            expr: sum by (instance) (rate(node_disk_reads_completed_total{device=~"vd.*"}[1m]))
          # 计算 1 分钟内平均每秒处理磁盘写请求数,对应 iostat -dxk 中的 w/s
          - record: node:disk:write_iops
            expr: sum by (instance) (rate(node_disk_writes_completed_total{device=~"vd.*"}[1m]))
          # 计算 1 分钟内平均每秒处理磁盘读带宽,对应 iostat -dxk 中的 rkB/s
          - record: node:disk:read_bandwidth
            expr: sum by (instance) (irate(node_disk_read_bytes_total{device=~"vd.*"}[1m]))
          # 计算 1 分钟内平均每秒处理磁盘写带宽,对应 iostat -dxk 中的 wkB/s
          - record: node:disk:write_bandwidth
            expr: sum by (instance) (irate(node_disk_written_bytes_total{device=~"vd.*"}[1m]))
          # 计算 1 分钟内平均读请求延迟 ms,对应 iostat -dxk 中的 r_await
          - record: node:disk:read_await
            expr: sum by (instance) (rate(node_disk_read_time_seconds_total{device=~"vd.*"}[1m]) / rate(node_disk_reads_completed_total{device=~"vd.*"}[1m]) * 1000)
          # 计算 1 分钟内平均写请求延迟,对应 iostat -dxk 中的 w_await
          - record: node:disk:write_await
            expr: sum by (instance) (rate(node_disk_write_time_seconds_total{device=~"vd.*"}[1m]) / rate(node_disk_writes_completed_total{device=~"vd.*"}[1m]) * 1000)
          #################
          #  File Descriptor
          #################
          # 系统已用文件描述符百分比
          - record: node:proc:os_fd_usage
            expr: (node_filefd_allocated / node_filefd_maximum)
          # 进程已用文件描述符百分比
          - record: node:proc:proc_fd_usage
            expr: (process_open_fds{job="node"} / process_max_fds{job="node"})
          #################
          #  Network
          #################
          # 各实例、各网卡 1 分钟内平均每秒接收字节数
          - record: node:net:network_rx
            expr: sum by(instance, device) (irate(node_network_receive_bytes_total{device=~"eth.*"}[1m]))
          # 各实例、各网卡 1 分钟内平均每秒发送字节数
          - record: node:net:network_tx
            expr: sum by(instance, device) (irate(node_network_transmit_bytes_total{device=~"eth.*"}[1m]))
          #################
          #  TCP
          #################
          # 各实例、各网卡 5 分钟内入向报文错误包占比(平均每秒)
          - record: node:tcp:rx_error_rate5m
            expr: sum by(instance, device) (rate(node_network_receive_errs_total{device=~"eth.*"}[5m]) / rate(node_network_receive_packets_total{device=~"eth.*"}[5m]))
          # 各实例、各网卡 5 分钟内出向报文错误包占比(平均每秒)
          - record: node:tcp:tx_error_rate5m
            expr: sum by(instance, device) (rate(node_network_transmit_errs_total{device=~"eth.*"}[5m]) / rate(node_network_transmit_packets_total{device=~"eth.*"}[5m]))
          # 各实例、各网卡 5 分钟内入向报文丢弃包占比(平均每秒)
          - record: node:tcp:rx_drop_rate5m
            expr: sum by(instance, device) (rate(node_network_receive_drop_total{device=~"eth.*"}[5m]) / rate(node_network_receive_packets_total{device=~"eth.*"}[5m]))
          # 各实例、各网卡 5 分钟内出向报文丢弃包占比(平均每秒)
          - record: node:tcp:tx_drop_rate5m
            expr: sum by(instance, device) (rate(node_network_transmit_drop_total{device=~"eth.*"}[5m]) / rate(node_network_transmit_drop_total{device=~"eth.*"}[5m]))
          # 当前重传报文率 与 30 分钟前对比,涨幅百分比
          - record: node:tcp:retrans_rate5m
            expr: (irate(node_netstat_Tcp_RetransSegs[1m]) / irate(node_netstat_Tcp_OutSegs[1m])) - (irate(node_netstat_Tcp_RetransSegs[1m] offset 30m) / irate(node_netstat_Tcp_OutSegs[1m] offset 30m))
          # 当前重置报文率 与 30 分钟前对比,涨幅百分比
          - record: node:tcp:rst_rate5m
            expr: (irate(node_netstat_Tcp_OutRsts[1m]) / irate(node_netstat_Tcp_OutSegs[1m])) - (irate(node_netstat_Tcp_OutRsts[1m] offset 30m) / irate(node_netstat_Tcp_OutSegs[1m] offset 30m))
          #################
          #  TCP Socket
          #################
          # 半连接队列 syn_backlog 溢出情况
          - record: node:socket:listen_drop
            expr: irate(node_netstat_TcpExt_ListenDrops[1m])
          # 全连接队列 accept 溢出情况
          - record: node:socket:listen_overflow
            expr: irate(node_netstat_TcpExt_ListenOverflows[1m])
          # 连接追踪表使用率
          #################
          #  conntrack table
          #################
          - record: node:net:conntrack_tb_usage
            expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit)
  node_alerts.yml: |+
    groups:
      - name: node_alerts
        rules:
        ###### CPU ######
        - alert: HostHighCpuLoad
          # 最近 1m CPU 使用率超过 80%
          expr: node:cpu:cpu_usage > 0.8
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "{{ $labels.instance }} 节点 CPU 使用率过高"
            description: "最近一分钟内 {{ $labels.instance }} 节点 CPU 使用率超过 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
            console: "URL: http://baidu.com"
        - alert: HostHighCpuCoreLoad
          # 最近 1m CPU 某个核心使用率超过 80%
          expr: node:cpu:per_cpu_usage > 0.8
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "{{ $labels.instance }} 节点 CPU 核心使用率过高"
            description: "最近一分钟内 {{ $labels.instance }} 节点 CPU 核心 {{ $labels.cpu }} 使用率超过 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
            console: "URL: http://baidu.com"
        ###### Memory ######
        - alert: HostHighTmpfsUsed
          # tmpfs 内存使用超过 1 GiB
          expr: node:mem:tmpfs_used > 200
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "{{ $labels.instance }} 节点 tmpfs 使用率过高 !"
            description: "最近一分钟内 {{ $labels.instance }} 节点 tmpfs 使用率过高 !\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostHighMemorySlabUnreclaimUsed
          # slab 不可回收内存量内存量过高
          expr: node:mem:slab_sunreclaim > 1024
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "{{ $labels.instance }} slab 不可回收内存量内存量过高 "
            description: "最近一分钟内 {{ $labels.instance }} slab 不可回收内存量内存量过高 !\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostHighMemoryLruUnreclaimUsed
          # slab 不可回收内存量内存量过高
          expr: node:mem:lru_unevictable > 2048
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "{{ $labels.instance }} lru list 不可回收内存量内存量过高"
            description: "最近一分钟内 {{ $labels.instance }} lru list 不可回收内存量内存量过高 !\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        ###### Disk ######
        - alert: HostOutOfDiskSpace
          # 磁盘空间使用率超过 90%
          expr: node:disk:disk_space_usage > 0.9
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "最近一分钟内 {{ $labels.instance }} 节点 CPU 使用率超过 80%"
            description: "最近一分钟内 {{ $labels.instance }} 节点 CPU 使用率超过 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostDiskWillFillIn24Hour
          # 通过predict_linear函数根据过去1h的数据,推测4小时后磁盘是否会满
          expr: predict_linear(node_filesystem_free_bytes[1h], 24*3600) < 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "预计实例 {{ $labels.instance }} 挂载点将在一天后打满!"
        - alert: HostOutofDiskInodes
          expr: node:disk:inode_space_usage > 0.8
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 磁盘 inode 超过 80%"
            description: "节点 {{ $labels.instance }} 磁盘 inode 超过 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostInodesWillFillIn24Hour
          # 通过predict_linear函数根据过去1h的数据,推测4小时后磁盘 inode是否会满
          expr: predict_linear(node_filesystem_files_free[1h], 24*3600) < 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "预计实例 {{ $labels.instance }} 磁盘 inode 将在一天后打满!"
        ###### DiskIO ######
        - alert: HostUnusualDiskReadLatency
          expr: node:disk:read_await > 100
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 磁盘 读请求耗时(r_await)异常"
            description: "节点 {{ $labels.instance }} 磁盘 读请求耗时(r_await)异常!\n当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostUnusualDiskWriteLatency
          expr: node:disk:write_await > 100
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 磁盘 写请求耗时(w_await)异常"
            description: "节点 {{ $labels.instance }} 磁盘 写请求耗时(w_await)异常!\n当前值:{{ $value }}\n LABELS = {{ $labels }}"
        ###### File Descriptor ######
        - alert: HostHighSystemFdUsed
          expr: node:proc:os_fd_usage > 0.8
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 系统文件描述符使用率超过 80%"
            description: "节点 {{ $labels.instance }} 系统文件描述符使用率 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        ###### File Descriptor ######
        - alert: HostHighSystemFdUsed
          expr: node:proc:proc_fd_usage > 0.8
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 进程文件描述符使用率超过 80%"
            description: "节点 {{ $labels.instance }} 进程文件描述符使用率 80%!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        ###### TCP ######
        - alert: HostNetworkReceiveErrRate
          expr: node:tcp:rx_error_rate5m > 0.01
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 接收报文错误占比异常"
            description: "节点 {{ $labels.instance }} 接收报文错误占比异常!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostNetworkTransmitErrRate
          expr: node:tcp:tx_error_rate5m > 0.01
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 发送报文错误占比异常"
            description: "节点 {{ $labels.instance }} 发送报文错误占比异常!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostNetworkReceiveDropRate
          expr: node:tcp:rx_drop_rate5m > 0.01
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 接收报文丢弃占比异常"
            description: "节点 {{ $labels.instance }} 接收报文丢弃占比异常!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostNetworkTransmitDropRate
          expr: node:tcp:rx_drop_rate5m > 0.01
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 发送报文丢弃占比异常"
            description: "节点 {{ $labels.instance }} 发送报文丢弃占比异常!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostUnusualNetworkRetransRate
          expr: node:tcp:retrans_rate5m > 20
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 报文重传率发生异常升高"
            description: "节点 {{ $labels.instance }} 报文重传率发生异常升高!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostUnusualNetworkResetRate
          expr: node:tcp:rst_rate5m > 20
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 报文重置率发生异常升高"
            description: "节点 {{ $labels.instance }} 报文重置率发生异常升高!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        ###### TCP Socket ######
        - alert: HostSynBacklogOverflow
          expr: node:socket:listen_overflow > 10
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 半连接队列存在溢出现象"
            description: "节点 {{ $labels.instance }} 半连接队列存在溢出现象!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostAcceptBacklogverflow
          expr: node:socket:listen_overflow > 10
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 半连接队列存在溢出现象"
            description: "节点 {{ $labels.instance }} 半连接队列存在溢出现象!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"
        - alert: HostHighConntrackTableUsage
          expr: node:net:conntrack_tb_usage > 80
          for: 1m
          labels:
            security: warning
          annotations:
            summary: "节点 {{ $labels.instance }} 连接追踪表使用率过高"
            description: "节点 {{ $labels.instance }} 连接追踪表使用率过高!\n 当前值:{{ $value }}\n LABELS = {{ $labels }}"

vmalert 工作负载定义

apiVersion: v1
kind: Service
metadata:
  name: vmalert
  namespace: kube-vm
  labels:
    app: vmalert
spec:
  type: NodePort
  selector:
    app: vmalert
  ports:
    - name: vmalert
      port: 8080
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmalert
  namespace: kube-vm
  labels:
    app: vmalert
spec:
  selector:
    matchLabels:
      app: vmalert
  template:
    metadata:
      labels:
        app: vmalert
    spec:
      volumes:
        - name: rulus
          configMap:
            name: configmap-vmalert-rules
      containers:
        - name: vmalert
          image: victoriametrics/vmalert:v1.77.0
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - mountPath: /etc/rules/
              name: rulus
              readOnly: true
          args:
            - -rule=/etc/rules/*.yml
            - -datasource.url=http://vmselect.kube-vm.svc.cluster.local:8481/select/0/prometheus
            - -notifier.url=http://alertmanager.kube-vm.svc.cluster.local:9193
            - -remoteWrite.url=http://vminsert.kube-vm.svc.cluster.local:8480/insert/0/prometheus
            - -evaluationInterval=15s
            - -httpListenAddr=0.0.0.0:8080

创建资源

$ kc apply -f configmap-vmalert-rules.yml
configmap/configmap-vmalert-rules created

观察状态

$ kc get deploy -n kube-vm -l app=vmalert 
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
vmalert   1/1     1    1    10s	

$ kc -n kube-vm logs -f -l app=vmalert   
2023-04-14T03:53:27.008Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "evaluationInterval"="15s"
2023-04-14T03:53:27.008Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "httpListenAddr"="0.0.0.0:8080"
2023-04-14T03:53:27.008Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "notifier.url"="http://alertmanager.kube-vm.svc.cluster.local:9093"
2023-04-14T03:53:27.008Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "remoteWrite.url"="http://vminsert.kube-vm.svc.cluster.local:8480/insert/0/prometheus"
2023-04-14T03:53:27.008Z	info	VictoriaMetrics/lib/logger/flag.go:20	flag "rule"="/etc/rules/*.yml"
2023-04-14T03:53:27.009Z	info	VictoriaMetrics/app/vmalert/main.go:131	reading rules configuration file from "/etc/rules/*.yml"
2023-04-14T03:53:27.022Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:92	starting http server at http://0.0.0.0:8080/
2023-04-14T03:53:27.022Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:93	pprof handlers are exposed at http://0.0.0.0:8080/debug/pprof/
2023-04-14T03:53:36.123Z	info	VictoriaMetrics/app/vmalert/group.go:262	group "node_alerts" started; interval=15s; concurrency=1
2023-04-14T03:53:39.681Z	info	VictoriaMetrics/app/vmalert/group.go:262	group "node_rules" started; interval=15s; concurrency=1

访问 WebUI,查询持久化指标

grafana

Grafana 使用的也是 LocalPath ,所以这里提前在 k8s-worker03 创建目录

$ mkdir -p /data/k8s/grafana

资源定义

---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: kube-vm
  labels:
    app: grafana
spec:
  type: NodePort
  ports:
    - port: 3000
      nodePort: 30001
  selector:
    app: grafana
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: kube-vm
  labels:
    app: grafana
spec:
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: grafana-data
      initContainers:
        - name: fix-permissions
          image: busybox
          command: [chown, -R, "472:472", "/var/lib/grafana"]
          volumeMounts:
            - mountPath: /var/lib/grafana
              name: storage
      containers:
        - name: grafana
          image: grafana/grafana:9.4.7
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 3000
              name: grafana
          env:
            - name: GF_SECURITY_ADMIN_USER
              value: admin
            - name: GF_SECURITY_ADMIN_PASSWORD
              value: LotusChing
          readinessProbe:
            failureThreshold: 10
            httpGet:
              path: /api/health
              port: 3000
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 30
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /api/health
              port: 3000
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 150m
              memory: 512Mi
            requests:
              cpu: 150m
              memory: 512Mi
          volumeMounts:
            - mountPath: /var/lib/grafana
              name: storage

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-local
  namespace: kube-vm
  labels:
    app: grafana
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 1Gi
  storageClassName: local-storage
  local:
    # 需要提前创建该目录
    path: /data/k8s/grafana
  persistentVolumeReclaimPolicy: Retain
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                # 亲和性选择节点 k8s-worker03
                - k8s-worker03
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-data
  namespace: kube-vm
  labels:
    app: grafana
spec:
  selector:
    matchLabels:
      app: grafana
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: local-storage

创建资源,并观察状态

$ kc apply -f grafana.yml    
service/grafana created
deployment.apps/grafana created
persistentvolume/grafana-local created
persistentvolumeclaim/grafana-data created

$ kc get deploy -n kube-vm
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
grafana   1/1     1    1    99s

检查确认:监控大盘数据

等到 grafana 工作负载 Ready,访问 WebUI,创建数据源,这里数据源配置使用 vmselect

导入仪表盘 18435,检查监控数据是否正常查询、渲染、展示

检查确认:告警流程

通过 dd 命令,快速创建文件,触发告警,测试整个告警通知流程能否正常跑通

$ dd if=/dev/urandom of=testfile bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 1.24061 s, 254 MB/s

访问 vmalert WebUI 等待活跃告警

打开钉钉,等待消息通知

资源汇总

贴一下最终 Kubernetes 资源汇总

ConfigMap

$ kc get cm -n kube-vm

NAME                            DATA   AGE
configmap-alertmanager-config   1      120m
configmap-vmagent-config        1      143m
configmap-vmalert-rules         2      100m
kube-root-ca.crt                1      154m

Secret

$ kc get secret -n kube-vm

NAME                     TYPE                                  DATA   AGE
default-token-ltvtd      kubernetes.io/service-account-token   3      154m
secret-promoter-config   Opaque                                1      114m
vmagent-sa-token-jkrnp   kubernetes.io/service-account-token   3      153m

StorageClass

$ kc get sc -n kube-vm

NAME            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  153m、

PV

$ kc get pv -n kube-vm

NAME               CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                       STORAGECLASS    REASON   AGE
grafana-local      1Gi        RWO            Retain           Bound    kube-vm/grafana-data        local-storage            152m
vmagent-pv         2Gi        RWX            Retain           Bound    kube-vm/vmagent-pvc         local-storage            136m
vmstore-local-pv   20Gi       RWX            Retain           Bound    kube-vm/vmstore-local-pvc   local-storage            151m

PVC

$ kc get pvc -n kube-vm

grafana-data        Bound    grafana-local      1Gi        RWO            local-storage   152m
vmagent-pvc         Bound    vmagent-pv         2Gi        RWX            local-storage   136m
vmstore-local-pvc   Bound    vmstore-local-pv   20Gi       RWX            local-storage   152m

DaemonSet

$ kc get ds -n kube-vm

NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   3         3         3       3            3           kubernetes.io/os=linux   154m

Deployment

$ kc get deploy -n kube-vm

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
alertmanager   1/1     1            1           121m
grafana        1/1     1            1           152m
promoter       1/1     1            1           114m
vmalert        1/1     1            1           100m
vminsert       1/1     1            1           145m
vmselect       1/1     1            1           149m

StatefulSet

$ kc get sts -n kube-vm

NAME        READY   AGE
vmagent     2/2     136m
vmstorage   2/2     152m

Service

$ kc get svc -n kube-vm -o wide 

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE    SELECTOR
alertmanager        ClusterIP   10.108.40.17     <none>        9193/TCP                     121m   app=alertmanager
cluster-vmstorage   ClusterIP   None             <none>        8482/TCP,8401/TCP,8400/TCP   152m   app=vmstorage
grafana             NodePort    10.109.99.238    <none>        3000:30001/TCP               153m   app=grafana
promoter            ClusterIP   10.109.5.115     <none>        9194/TCP                     114m   app=promoter
vmagent             ClusterIP   None             <none>        8429/TCP                     136m   app=vmagent
vmalert             NodePort    10.105.4.29      <none>        8080:31303/TCP               100m   app=vmalert
vminsert            ClusterIP   10.97.67.96      <none>        8480/TCP                     145m   app=vminsert
vmselect            NodePort    10.103.106.255   <none>        8481:32515/TCP               149m   app=vmselect

镜像

$ ctr --namespace k8s.io images ls -q|grep -v 'sha256'

docker.io/grafana/grafana:9.4.7
docker.io/lotusching/promoter:latest
docker.io/prom/alertmanager:v0.25.0
docker.io/prom/node-exporter:v1.5.0
docker.io/prom/prometheus:v2.35.0
docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.0
docker.io/rancher/mirrored-flannelcni-flannel:v0.20.1
docker.io/victoriametrics/vmagent:v1.77.0
docker.io/victoriametrics/vmalert:v1.77.0
docker.io/victoriametrics/vminsert:v1.77.0-cluster
docker.io/victoriametrics/vmselect:v1.77.0-cluster
docker.io/victoriametrics/vmstorage:v1.77.0-cluster
registry.aliyuncs.com/google_containers/coredns:v1.8.4
registry.aliyuncs.com/google_containers/etcd:3.5.0-0
registry.aliyuncs.com/google_containers/kube-apiserver:v1.22.2
registry.aliyuncs.com/google_containers/kube-controller-manager:v1.22.2
registry.aliyuncs.com/google_containers/kube-proxy:v1.22.2
registry.aliyuncs.com/google_containers/kube-scheduler:v1.22.2
registry.aliyuncs.com/google_containers/pause:3.5
registry.aliyuncs.com/google_containers/pause:3.6

故障排查

贴一下大致的排障思路

  1. 检查 Pod 状态,观察 READY、STATUS、RESTART 列

    $ kc get pods -o wide -n kube-mon
  2. 如果状态不正确,检查 Pod 状态详细描述

    $ kc -n kube-mon describe -l app=<name> 
    • 检查是否正常调度
    • 检查 PV、PVC 是否正确挂载
  3. 如果 Pod 是 PV、PVC 相关问题

    • 检查 pv、pvc 是否正确关联绑定,关注 NAMECLAIM

      NAME                             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS    REASON   AGE
      persistentvolume/grafana-local   1Gi        RWO            Retain           Bound    kube-mon/grafana-data   local-storage            3h20m
  4. 仪表盘 Sytemd 服务单元状态无数据,这是因为 node_exporter 运行在容器中,暂时没去处理,等后续补上处理方式

集群备份

victoria-metrics 提供了与备份相关的两个二进制程序

  • vmbackup:负责从快照从生成备份数据,如果目标目录已有备份,则自动使用增量方式备份
  • vmrestore:负责从备份数据还原指标数据

victoria-metrics 备份操作过程主要就是两步

  • 通过 http api 创建快照
  • 通过 二进制程序生成备份数据

过程比较简单,如下所示

1. 创建快照

victoria-metrics 提供了 http api,这里需要先获取各 vmstorage pod 的 ip

获取 vmstorage pod 的 IP

$ kc get pods -o wide -l app=vmstorage -n kube-vm
NAME          READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
vmstorage-0   1/1     Running   0          7m39s   10.244.1.9    k8s-worker02   <none>           <none>
vmstorage-1   1/1     Running   0          7m24s   10.244.1.10   k8s-worker02   <none>           <none>

服务监听端口配置到了 8482,所以这里直接通过 curl 命令创建快照

$ curl http://10.244.1.3:8482/snapshot/create
{"status":"ok","snapshot":"20230414055930-1755ADF679466C8F"}

观察日志

$ kc -n kube-vm logs -f vmstorage-0

# ...
{"ts":"2023-04-14T05:59:30.170Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:330","msg":"creating Storage snapshot for \"/storage/vmstorage-0\"..."}
{"ts":"2023-04-14T05:59:30.176Z","level":"info","caller":"VictoriaMetrics/lib/storage/table.go:145","msg":"creating table snapshot of \"/storage/vmstorage-0/data\"..."}
{"ts":"2023-04-14T05:59:30.181Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1604","msg":"creating partition snapshot of \"/storage/vmstorage-0/data/small/2023_04\" and \"/storage/vmstorage-0/data/big/2023_04\"..."}
{"ts":"2023-04-14T05:59:30.340Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1625","msg":"created partition snapshot of \"/storage/vmstorage-0/data/small/2023_04\" and \"/storage/vmstorage-0/data/big/2023_04\" at \"/storage/vmstorage-0/data/small/snapshots/20230414055930-1755ADF679466C8F/2023_04\" and \"/storage/vmstorage-0/data/big/snapshots/20230414055930-1755ADF679466C8F/2023_04\" in 0.159 seconds"}
{"ts":"2023-04-14T05:59:30.340Z","level":"info","caller":"VictoriaMetrics/lib/storage/table.go:173","msg":"created table snapshot for \"/storage/vmstorage-0/data\" at (\"/storage/vmstorage-0/data/small/snapshots/20230414055930-1755ADF679466C8F\", \"/storage/vmstorage-0/data/big/snapshots/20230414055930-1755ADF679466C8F\") in 0.163 seconds"}
{"ts":"2023-04-14T05:59:30.343Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:1146","msg":"creating Table snapshot of \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\"..."}
{"ts":"2023-04-14T05:59:30.386Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:1215","msg":"created Table snapshot of \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\" at \"/storage/vmstorage-0/indexdb/snapshots/20230414055930-1755ADF679466C8F/1755ADF679474FBF\" in 0.043 seconds"}
{"ts":"2023-04-14T05:59:30.386Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:1146","msg":"creating Table snapshot of \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\"..."}
{"ts":"2023-04-14T05:59:30.393Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:1215","msg":"created Table snapshot of \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\" at \"/storage/vmstorage-0/indexdb/snapshots/20230414055930-1755ADF679466C8F/1755ADF679474FBE\" in 0.006 seconds"}
{"ts":"2023-04-14T05:59:30.409Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:387","msg":"created Storage snapshot for \"/storage/vmstorage-0\" at \"/storage/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F\" in 0.239 seconds"}

2. 列出快照

$ curl http://10.244.1.3:8482/snapshot/list  
{"status":"ok","snapshots":[
"20230414055930-1755ADF679466C8F"
]}

查看数据目录

$ ls vmstorage-0/snapshots 
20230414055930-1755ADF679466C8F
$ tree vmstorage-0/snapshots 
vmstorage-0/snapshots
└── 20230414055930-1755ADF679466C8F
    ├── data
    │   ├── big -> ../../../data/big/snapshots/20230414055930-1755ADF679466C8F
    │   └── small -> ../../../data/small/snapshots/20230414055930-1755ADF679466C8F
    ├── indexdb -> ../../indexdb/snapshots/20230414055930-1755ADF679466C8F
    └── metadata
    └── minTimestampForCompositeIndex

6 directories, 1 file

3. 全量备份

获取 vmbackup、vmrestore 命令

$ wget http://download.yo-yo.fun/prometheus/vmbackup
$ wget http://download.yo-yo.fun/prometheus/vmrestore
$ chmod +x vmbackup; chmod +x vmrestore
$ mv vmbackup /usr/bin/; mv vmrestore /usr/bin/

创建备份目录

$ mkdir /data/backup

执行全备

$ vmbackup -storageDataPath=/data/k8s/vmstore/vmstorage-0 -snapshotName=20230414055930-1755ADF679466C8F -dst=fs:///data/backup/vmstorage-0/

输出如下

# 输出命令行参数
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/logger/flag.go:12	build version: vmbackup-20230407-010908-tags-v1.90.0-0-gb5d18c0d2
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/logger/flag.go:13	command-line flags
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -dst="fs:///data/backup/vmstorage-0/"
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -snapshotName="20230414055930-1755ADF679466C8F"
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -storageDataPath="/data/k8s/vmstore/vmstorage-0"
# 开始执行备份
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/backup/actions/backup.go:78	starting backup from fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to fsremote "/data/backup/vmstorage-0/" using origin fsnil
# 这里似乎是起了 web server
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:96	starting http server at http://127.0.0.1:8420/
2023-04-14T06:10:48.361Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:97	pprof handlers are exposed at http://127.0.0.1:8420/debug/pprof/
# 发现 128 parts
2023-04-14T06:10:48.363Z	info	VictoriaMetrics/lib/backup/actions/backup.go:84	obtained 128 parts from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F"
2023-04-14T06:10:48.363Z	info	VictoriaMetrics/lib/backup/actions/backup.go:90	obtained 0 parts from dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.363Z	info	VictoriaMetrics/lib/backup/actions/backup.go:96	obtained 0 parts from origin fsnil
# 上传 parts 到 fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.365Z	info	VictoriaMetrics/lib/backup/actions/backup.go:149	uploading 128 parts from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.365Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/6830_4703_20230414055700.000_20230414055852.749_1755AF19C405E6B3/index.bin", file_size: 84346, offset: 0, size: 84346} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.365Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/6830_4703_20230414055700.000_20230414055852.749_1755AF19C405E6B3/min_dedup_interval", file_size: 4, offset: 0, size: 4} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.365Z	info	VictoriaMetrics/lib/memory/memory.go:42	limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60
2023-04-14T06:10:48.365Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/6830_4703_20230414055700.000_20230414055852.749_1755AF19C405E6B3/timestamps.bin", file_size: 2922, offset: 0, size: 2922} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.366Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/6830_4703_20230414055700.000_20230414055852.749_1755AF19C405E6B3/values.bin", file_size: 1147, offset: 0, size: 1147} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.366Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/827_827_20230414055830.000_20230414055855.456_1755AF19C405E6B5/min_dedup_interval", file_size: 4, offset: 0, size: 4} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
2023-04-14T06:10:48.372Z	info	VictoriaMetrics/lib/backup/actions/backup.go:152	uploading part{path: "data/small/2023_04/475572_13378_20230414035345.000_20230414042918.994_1755AF19C405E240/min_dedup_interval", file_size: 4, offset: 0, size: 4} from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/"
# 省略...
# 上传完成耗时 167.485473ms
2023-04-14T06:10:48.532Z	info	VictoriaMetrics/lib/backup/actions/backup.go:170	uploaded 5987563 out of 5987563 bytes from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/" in 167.485473ms
2023-04-14T06:10:48.533Z	info	VictoriaMetrics/lib/backup/actions/backup.go:179	backup from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414055930-1755ADF679466C8F" to dst fsremote "/data/backup/vmstorage-0/" with origin fsnil is complete; backed up 5987563 bytes in 0.172 seconds; deleted 0 bytes; server-side copied 0 bytes; uploaded 5987563 bytes
2023-04-14T06:10:48.533Z	info	VictoriaMetrics/app/vmbackup/main.go:108	gracefully shutting down http server for metrics at ":8420"
# 关闭 web server
2023-04-14T06:10:48.533Z	info	VictoriaMetrics/app/vmbackup/main.go:112	successfully shut down http server for metrics in 0.000 seconds

查看备份目录

$ ls /data/backup/vmstorage-0/
backup_complete.ignore  data  indexdb  metadata

OK,这里有了一份全量

4. 增量备份

由于数据是持续写入的,所以这段时间肯定也会产生数据,这里再进行一次增量备份

再创建一份创建快照

$ vmstorage-0  curl http://10.244.1.3:8482/snapshot/create
{"status":"ok","snapshot":"20230414061750-1755ADF679466C90"}#    
$ vmstorage-0  curl http://10.244.1.3:8482/snapshot/list  
{"status":"ok","snapshots":[
"20230414055930-1755ADF679466C8F",
"20230414061750-1755ADF679466C90"
]}#    

执行增量备份

$ vmbackup -storageDataPath=/data/k8s/vmstore/vmstorage-0 -snapshotName=20230414061750-1755ADF679466C90 -dst=fs:///data/backup/vmstorage-0/

# ...
2023-04-14T06:19:17.902Z	info	VictoriaMetrics/lib/backup/actions/backup.go:170	uploaded 1514709 out of 1514709 bytes from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414061750-1755ADF679466C90" to dst fsremote "/data/backup/vmstorage-0/" in 85.711256ms
2023-04-14T06:19:17.902Z	info	VictoriaMetrics/lib/backup/actions/backup.go:179	backup from src fslocal "/data/k8s/vmstore/vmstorage-0/snapshots/20230414061750-1755ADF679466C90" to dst fsremote "/data/backup/vmstorage-0/" with origin fsnil is complete; backed up 6576917 bytes in 0.106 seconds; deleted 925355 bytes; server-side copied 0 bytes; uploaded 1514709 bytes
2023-04-14T06:19:17.902Z	info	VictoriaMetrics/app/vmbackup/main.go:108	gracefully shutting down http server for metrics at ":8420"
2023-04-14T06:19:17.902Z	info	VictoriaMetrics/app/vmbackup/main.go:112	successfully shut down http server for metrics in 0.000 seconds

因为,我们之前全量备份过一次,所以这一次增量备份执行的很快

同理,给另一个实例 vmstorage-1 也备份一下

$ curl http://10.244.1.4:8482/snapshot/create
{"status":"ok","snapshot":"20230414062614-1755ADF9F686DD3A"}#                                                                                       
$ vmbackup -storageDataPath=/data/k8s/vmstore/vmstorage-1 -snapshotName=20230414062614-1755ADF9F686DD3A -dst=fs:///data/backup/vmstorage-1/

5. 模拟故障

删除实例、删除数据,模拟数据丢失场景

$ rm -rf /data/k8s/vmstore/vmstorage-0/*
$ rm -rf /data/k8s/vmstore/vmstorage-1/*
$ tree
.
├── vmstorage-0
└── vmstorage-1

6. 数据恢复

执行 vmrestore 恢复 vmstorage-0 数据

$ vmrestore -src=fs:///data/backup/vmstorage-0 -storageDataPath=/data/k8s/vmstore/vmstorage-0

输出如下


2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/logger/flag.go:12	build version: vmrestore-20230407-011039-tags-v1.90.0-0-gb5d18c0d2
2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/logger/flag.go:13	command-line flags
2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -src="fs:///data/backup/vmstorage-0"
2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -storageDataPath="/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/backup/actions/restore.go:75	starting restore from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.454Z	info	VictoriaMetrics/lib/backup/actions/restore.go:77	obtaining list of parts at fsremote "/data/backup/vmstorage-0"
2023-04-14T06:32:50.455Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:96	starting http server at http://127.0.0.1:8421/
2023-04-14T06:32:50.455Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:97	pprof handlers are exposed at http://127.0.0.1:8421/debug/pprof/
2023-04-14T06:32:50.462Z	info	VictoriaMetrics/lib/backup/actions/restore.go:82	obtaining list of parts at fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.462Z	info	VictoriaMetrics/lib/backup/actions/restore.go:162	downloading 118 parts from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.462Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "data/small/2023_04/7673_4708_20230414061643.107_20230414061737.749_1755AF19C405E79E/min_dedup_interval", file_size: 4, offset: 0, size: 4} from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.462Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "data/small/2023_04/475572_13378_20230414035345.000_20230414042918.994_1755AF19C405E240/min_dedup_interval", file_size: 4, offset: 0, size: 4} from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.463Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "indexdb/1755ADF679474FBF/87659_221_1755ADF67C1883C0/metaindex.bin", file_size: 389, offset: 0, size: 389} from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.463Z	info	VictoriaMetrics/lib/memory/memory.go:42	limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60
2023-04-14T06:32:50.463Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "data/small/2023_04/7673_4708_20230414061643.107_20230414061737.749_1755AF19C405E79E/metaindex.bin", file_size: 291, offset: 0, size: 291} from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
# ...
2023-04-14T06:32:50.640Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "indexdb/1755ADF679474FBF/98_1_1755ADF67C188460/items.bin", file_size: 1086, offset: 0, size: 1086} from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0"
2023-04-14T06:32:50.646Z	info	VictoriaMetrics/lib/backup/actions/restore.go:188	downloaded 6576917 out of 6576917 bytes from fsremote "/data/backup/vmstorage-0" to fslocal "/data/k8s/vmstore/vmstorage-0" in 184.189806ms
2023-04-14T06:32:50.646Z	info	VictoriaMetrics/lib/backup/actions/restore.go:195	restored 6576917 bytes from backup in 0.192 seconds; deleted 0 bytes; downloaded 6576917 bytes
2023-04-14T06:32:50.647Z	info	VictoriaMetrics/app/vmrestore/main.go:64	gracefully shutting down http server for metrics at ":8421"
2023-04-14T06:32:50.647Z	info	VictoriaMetrics/app/vmrestore/main.go:68	successfully shut down http server for metrics in 0.000 seconds

执行 vmrestore 恢复 vmstorage-1 数据

$ vmrestore -src=fs:///data/backup/vmstorage-1 -storageDataPath=/data/k8s/vmstore/vmstorage-1

输出如下

2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/logger/flag.go:12	build version: vmrestore-20230407-011039-tags-v1.90.0-0-gb5d18c0d2
2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/logger/flag.go:13	command-line flags
2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -src="fs:///data/backup/vmstorage-1"
2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/logger/flag.go:20	  -storageDataPath="/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/backup/actions/restore.go:75	starting restore from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:26.875Z	info	VictoriaMetrics/lib/backup/actions/restore.go:77	obtaining list of parts at fsremote "/data/backup/vmstorage-1"
2023-04-14T06:33:26.876Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:96	starting http server at http://127.0.0.1:8421/
2023-04-14T06:33:26.876Z	info	VictoriaMetrics/lib/httpserver/httpserver.go:97	pprof handlers are exposed at http://127.0.0.1:8421/debug/pprof/
2023-04-14T06:33:26.880Z	info	VictoriaMetrics/lib/backup/actions/restore.go:82	obtaining list of parts at fslocal "/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:26.881Z	info	VictoriaMetrics/lib/backup/actions/restore.go:162	downloading 78 parts from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:26.881Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "data/small/2023_04/56_56_20230414035700.000_20230414035700.000_1755AF19C405A09D/min_dedup_interval", file_size: 4, offset: 0, size: 4} from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:26.881Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "indexdb/1755ADF9F68785C1/38_1_1755ADF9F9DC8FC1/metaindex.bin", file_size: 271, offset: 0, size: 271} from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1"
# ...
2023-04-14T06:33:26.998Z	info	VictoriaMetrics/lib/backup/actions/restore.go:169	downloading part{path: "data/small/2023_04/56_56_20230414040200.000_20230414040212.000_1755AF19C405A0DE/values.bin", file_size: 0, offset: 0, size: 0} from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1"
2023-04-14T06:33:27.006Z	info	VictoriaMetrics/lib/backup/actions/restore.go:188	downloaded 4656394 out of 4656394 bytes from fsremote "/data/backup/vmstorage-1" to fslocal "/data/k8s/vmstore/vmstorage-1" in 124.902351ms
2023-04-14T06:33:27.006Z	info	VictoriaMetrics/lib/backup/actions/restore.go:195	restored 4656394 bytes from backup in 0.131 seconds; deleted 0 bytes; downloaded 4656394 bytes
2023-04-14T06:33:27.006Z	info	VictoriaMetrics/app/vmrestore/main.go:64	gracefully shutting down http server for metrics at ":8421"
2023-04-14T06:33:27.006Z	info	VictoriaMetrics/app/vmrestore/main.go:68	successfully shut down http server for metrics in 0.000 seconds

7. 实例恢复

恢复完数据后,恢复创建 sts vmstorage 实例

$ kc apply -f vmstore.yml    
service/cluster-vmstorage configured
persistentvolume/vmstore-local-pv unchanged
persistentvolumeclaim/vmstore-local-pvc unchanged
statefulset.apps/vmstorage created

$ kc get sts -n kube-vm -l app=vmstorage 
NAME    READY   AGE
vmstorage   2/2     33s

查看启动日志,可以看到有在 loading 数据

$ kc -n kube-vm logs -f vmstorage-0     
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:12","msg":"build version: vmstorage-20220505-083109-tags-v1.77.0-cluster-0-g2ce1d0913"}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:13","msg":"command line flags"}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"dedup.minScrapeInterval\"=\"15s\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.enable\"=\"true\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.prefix\"=\"VM_\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"loggerFormat\"=\"json\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"retentionPeriod\"=\"1\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"storageDataPath\"=\"/storage/vmstorage-0\""}
{"ts":"2023-04-14T06:34:07.680Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:77","msg":"opening storage at \"/storage/vmstorage-0\" with -retentionPeriod=1"}
{"ts":"2023-04-14T06:34:07.684Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2023-04-14T06:34:07.684Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricName->TSID cache from \"/storage/vmstorage-0/cache/metricName_tsid\"..."}
{"ts":"2023-04-14T06:34:07.689Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricName->TSID cache from \"/storage/vmstorage-0/cache/metricName_tsid\" in 0.005 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:07.689Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricID->TSID cache from \"/storage/vmstorage-0/cache/metricID_tsid\"..."}
{"ts":"2023-04-14T06:34:07.690Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricID->TSID cache from \"/storage/vmstorage-0/cache/metricID_tsid\" in 0.001 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:07.690Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricID->MetricName cache from \"/storage/vmstorage-0/cache/metricID_metricName\"..."}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricID->MetricName cache from \"/storage/vmstorage-0/cache/metricID_metricName\" in 0.002 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:873","msg":"loading curr_hour_metric_ids from \"/storage/vmstorage-0/cache/curr_hour_metric_ids\"..."}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:876","msg":"nothing to load from \"/storage/vmstorage-0/cache/curr_hour_metric_ids\""}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:873","msg":"loading prev_hour_metric_ids from \"/storage/vmstorage-0/cache/prev_hour_metric_ids\"..."}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:876","msg":"nothing to load from \"/storage/vmstorage-0/cache/prev_hour_metric_ids\""}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:829","msg":"loading next_day_metric_ids from \"/storage/vmstorage-0/cache/next_day_metric_ids\"..."}
{"ts":"2023-04-14T06:34:07.692Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:832","msg":"nothing to load from \"/storage/vmstorage-0/cache/next_day_metric_ids\""}
{"ts":"2023-04-14T06:34:07.699Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\"..."}
{"ts":"2023-04-14T06:34:07.713Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-0/indexdb/1755ADF679474FBF\" has been opened in 0.014 seconds; partsCount: 6; blocksCount: 232, itemsCount: 92512; sizeBytes: 2409104"}
{"ts":"2023-04-14T06:34:07.714Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\"..."}
{"ts":"2023-04-14T06:34:07.723Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-0/indexdb/1755ADF679474FBE\" has been opened in 0.009 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:07.764Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/10368_10368_20230414061700.970_20230414061720.220_1755AF19C405E79B\" in 0.003 seconds"}
{"ts":"2023-04-14T06:34:07.765Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/475572_13378_20230414035345.000_20230414042918.994_1755AF19C405E240\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.766Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/8678_8678_20230414061732.456_20230414061733.994_1755AF19C405E79F\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.767Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/10754_10754_20230414061728.341_20230414061749.040_1755AF19C405E7A1\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.768Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/830_830_20230414061715.000_20230414061740.456_1755AF19C405E7A0\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.769Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/122524_13384_20230414055700.000_20230414060618.994_1755AF19C405E711\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.770Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/345014_13296_20230414042844.588_20230414045420.220_1755AF19C405E37F\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.770Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/425926_13230_20230414032236.647_20230414035433.994_1755AF19C405E083\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.771Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/7673_4708_20230414061643.107_20230414061737.749_1755AF19C405E79E\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.772Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/10368_10368_20230414061638.247_20230414061705.212_1755AF19C405E798\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.773Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/30566_13386_20230414061442.115_20230414061635.216_1755AF19C405E792\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.779Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/120520_13388_20230414060600.000_20230414061450.218_1755AF19C405E77D\" in 0.006 seconds"}
{"ts":"2023-04-14T06:34:07.780Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/54_54_20230414040630.000_20230414040630.000_1755AF19C405E125\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.781Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/54_54_20230414035815.000_20230414035815.000_1755AF19C405E0BC\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.781Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/436062_13322_20230414045330.000_20230414052603.994_1755AF19C405E512\" in 0.000 seconds"}
{"ts":"2023-04-14T06:34:07.782Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/10368_10368_20230414061630.980_20230414061650.220_1755AF19C405E795\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.783Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-0/data/small/2023_04/430134_13383_20230414052500.000_20230414055755.456_1755AF19C405E6A8\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:07.792Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:92","msg":"successfully opened storage \"/storage/vmstorage-0\" in 0.112 seconds; partsCount: 17; blocksCount: 162949; rowsCount: 2445465; sizeBytes: 4163294"}
{"ts":"2023-04-14T06:34:07.795Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:152","msg":"accepting vmselect conns at 0.0.0.0:8401"}
{"ts":"2023-04-14T06:34:07.795Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:85","msg":"accepting vminsert conns at 0.0.0.0:8400"}
{"ts":"2023-04-14T06:34:07.795Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8482/"}
{"ts":"2023-04-14T06:34:07.795Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8482/debug/pprof/"}
{"ts":"2023-04-14T06:34:51.247Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:164","msg":"accepted vmselect conn from 10.244.2.4:47268"}
{"ts":"2023-04-14T06:34:51.248Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:207","msg":"processing vmselect conn from 10.244.2.4:47268"}
{"ts":"2023-04-14T06:34:51.332Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:97","msg":"accepted vminsert conn from 10.244.1.5:55912"}
{"ts":"2023-04-14T06:34:51.334Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"processing vminsert conn from 10.244.1.5:55912"}

storage-1 的日志类似

$ kc -n kube-vm logs -f vmstorage-1
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:12","msg":"build version: vmstorage-20220505-083109-tags-v1.77.0-cluster-0-g2ce1d0913"}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:13","msg":"command line flags"}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"dedup.minScrapeInterval\"=\"15s\""}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.enable\"=\"true\""}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"envflag.prefix\"=\"VM_\""}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"loggerFormat\"=\"json\""}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"retentionPeriod\"=\"1\""}
{"ts":"2023-04-14T06:34:22.662Z","level":"info","caller":"VictoriaMetrics/lib/logger/flag.go:20","msg":"flag \"storageDataPath\"=\"/storage/vmstorage-1\""}
{"ts":"2023-04-14T06:34:22.663Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:77","msg":"opening storage at \"/storage/vmstorage-1\" with -retentionPeriod=1"}
{"ts":"2023-04-14T06:34:22.667Z","level":"info","caller":"VictoriaMetrics/lib/memory/memory.go:42","msg":"limiting caches to 5010650726 bytes, leaving 3340433818 bytes to the OS according to -memory.allowedPercent=60"}
{"ts":"2023-04-14T06:34:22.667Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricName->TSID cache from \"/storage/vmstorage-1/cache/metricName_tsid\"..."}
{"ts":"2023-04-14T06:34:22.673Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricName->TSID cache from \"/storage/vmstorage-1/cache/metricName_tsid\" in 0.005 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:22.673Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricID->TSID cache from \"/storage/vmstorage-1/cache/metricID_tsid\"..."}
{"ts":"2023-04-14T06:34:22.673Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricID->TSID cache from \"/storage/vmstorage-1/cache/metricID_tsid\" in 0.001 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:22.673Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1072","msg":"loading MetricID->MetricName cache from \"/storage/vmstorage-1/cache/metricID_metricName\"..."}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:1077","msg":"loaded MetricID->MetricName cache from \"/storage/vmstorage-1/cache/metricID_metricName\" in 0.001 seconds; entriesCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:873","msg":"loading curr_hour_metric_ids from \"/storage/vmstorage-1/cache/curr_hour_metric_ids\"..."}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:876","msg":"nothing to load from \"/storage/vmstorage-1/cache/curr_hour_metric_ids\""}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:873","msg":"loading prev_hour_metric_ids from \"/storage/vmstorage-1/cache/prev_hour_metric_ids\"..."}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:876","msg":"nothing to load from \"/storage/vmstorage-1/cache/prev_hour_metric_ids\""}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:829","msg":"loading next_day_metric_ids from \"/storage/vmstorage-1/cache/next_day_metric_ids\"..."}
{"ts":"2023-04-14T06:34:22.675Z","level":"info","caller":"VictoriaMetrics/lib/storage/storage.go:832","msg":"nothing to load from \"/storage/vmstorage-1/cache/next_day_metric_ids\""}
{"ts":"2023-04-14T06:34:22.681Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C1\"..."}
{"ts":"2023-04-14T06:34:22.696Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C1\" has been opened in 0.015 seconds; partsCount: 6; blocksCount: 239, itemsCount: 95977; sizeBytes: 2461935"}
{"ts":"2023-04-14T06:34:22.699Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:259","msg":"opening table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C0\"..."}
{"ts":"2023-04-14T06:34:22.708Z","level":"info","caller":"VictoriaMetrics/lib/mergeset/table.go:294","msg":"table \"/storage/vmstorage-1/indexdb/1755ADF9F68785C0\" has been opened in 0.009 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0"}
{"ts":"2023-04-14T06:34:22.745Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/56_56_20230414040330.000_20230414040330.000_1755AF19C405A0F2\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.745Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/2169_2169_20230414062545.000_20230414062613.760_1755AF19C405A806\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.747Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/31091_13609_20230414062041.348_20230414062235.210_1755AF19C405A7D9\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.748Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/2406706_13699_20230414032236.647_20230414062050.207_1755AF19C405A7C2\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.749Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/56_56_20230414040030.000_20230414040030.000_1755AF19C405A0CA\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.749Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/56_56_20230414035700.000_20230414035700.000_1755AF19C405A09D\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.750Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/56_56_20230414040200.000_20230414040212.000_1755AF19C405A0DE\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.751Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/40704_13611_20230414062229.384_20230414062420.209_1755AF19C405A7EE\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.752Z","level":"info","caller":"VictoriaMetrics/lib/storage/partition.go:1578","msg":"opened part \"/storage/vmstorage-1/data/small/2023_04/36877_13613_20230414062400.000_20230414062605.214_1755AF19C405A805\" in 0.001 seconds"}
{"ts":"2023-04-14T06:34:22.768Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/main.go:92","msg":"successfully opened storage \"/storage/vmstorage-1\" in 0.106 seconds; partsCount: 9; blocksCount: 56925; rowsCount: 2517771; sizeBytes: 2189940"}
{"ts":"2023-04-14T06:34:22.770Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:85","msg":"accepting vminsert conns at 0.0.0.0:8400"}
{"ts":"2023-04-14T06:34:22.770Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:88","msg":"starting http server at http://127.0.0.1:8482/"}
{"ts":"2023-04-14T06:34:22.770Z","level":"info","caller":"VictoriaMetrics/lib/httpserver/httpserver.go:89","msg":"pprof handlers are exposed at http://127.0.0.1:8482/debug/pprof/"}
{"ts":"2023-04-14T06:34:22.771Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:152","msg":"accepting vmselect conns at 0.0.0.0:8401"}
{"ts":"2023-04-14T06:34:51.220Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:164","msg":"accepted vmselect conn from 10.244.2.4:41638"}
{"ts":"2023-04-14T06:34:51.220Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:207","msg":"processing vmselect conn from 10.244.2.4:41638"}
{"ts":"2023-04-14T06:34:51.332Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:97","msg":"accepted vminsert conn from 10.244.1.5:40532"}
{"ts":"2023-04-14T06:34:51.333Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"processing vminsert conn from 10.244.1.5:40532"}

8. 检查确认

vmui

访问 vmui 查看指标数据,缺失了一部分,这是最后一次备份后产生的数据,符合正常预期

Grafana

Grafana 仪表盘查看,也有一个小缺口,同样符合预期


文章作者: Da
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Da !
  目录