`

prometheus 监控相关(非docker方式)

阅读更多

https://www.gitbook.com/book/songjiayang/prometheus/details (Prometheus 实战) 

https://github.com/1046102779/prometheus (Prometheus 非官方中文手册)

 

http://www.bubuko.com/infodetail-2004088.html (基于prometheus监控k8s集群)

http://www.cnblogs.com/sfnz/p/6566951.html (安装prometheus+grafana监控mysql redis kubernetes等,非docker安装)

http://blog.csdn.net/wenwst/article/details/76624019 (Kubernetes 1.6 部署prometheus和grafana数据持久))

https://github.com/jason-riddle/monitor-k8s-with-prom (Kubernetes 上prometheus监控相关)

https://github.com/kayrus/prometheus-kubernetes (prometheus-kubernetes) 

 

https://github.com/prometheus/node_exporter (prometheus/node_exporter)

http://dockone.io/article/2579 ( Prometheus在Kubernetes下的监控实践)

 

http://www.ywnds.com/?p=9656 ( 使用Prometheus+Grafana监控MySQL实践)

 

https://github.com/prometheus/prometheus/releases (prometheus 下载列表)

https://github.com/prometheus/node_exporter/releases/ (node_exporter下载列表)

 

 https://laily.net/article/Prometheus%20%E5%88%9D%E4%BD%93%E9%AA%8C%281%29%20-%20%E5%AE%89%E8%A3%85 (Prometheus 初体验(1) - 安装)

 

http://blog.csdn.net/u010871982/article/details/77838592?locationNum=2&fps=1 (prometheus简单入门)

https://www.robustperception.io/scaling-and-federating-prometheus/ (prometheus federate)

http://dbaplus.cn/news-72-1462-1.html (360基于Prometheus的在线服务监控实践)

 

1、prometheus安装

[root@localhost prometheus]# wget https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz

 

[root@localhost prometheus]# mkdir /opt/prometheus

[root@localhost prometheus]# tar -zxvf prometheus-1.7.1.linux-amd64.tar.gz -C /opt/prometheus --strip-components=1

 

[root@localhost prometheus]# cd /opt/prometheus/

[root@localhost prometheus]# cp prometheus.yml prometheus.yml.back

[root@localhost prometheus]# vim prometheus.yml     #注意 yaml 文件不允许有 tab 符,一律得使用空格

# 全局配置

global:

  scrape_interval:     15s #默认 15秒到目标处抓取数据

 

  # 这个标签是在本机上每一条时间序列上都会默认产生的,主要可以用于 联合查询、远程存储、Alertmanger时使用。

  external_labels:

    monitor: 'codelab-monitor'

 

# 这里就表示抓取对象的配置

# 设置抓取自身数据

scrape_configs:

  #  job name 这个配置是表示在这个配置内的时间序例,每一条都会自动添加上这个{job_name:"prometheus"}的标签。

  - job_name: 'prometheus'

 

    # 重写了全局抓取间隔时间,由15秒重写成5秒。

    scrape_interval: 5s

 

    static_configs:

      - targets: ['localhost:9090']

 

启动:

nohup ./prometheus --config.file=prometheus.yml &

nohup /opt/ prometheus-1.7.1.linux-amd64/prometheus &

这时 浏览器中页面访问http://localhost:9090/ ,可以看到Prometheus的graph页面。

http://www.cnblogs.com/vovlie/p/Prometheus_install.html (参考)

 

可直接加载Prometheus配置而不停止服务方式让配置生效,在调试过程中,每次修改配置后执行该操作让配置生效更方便:    

#     curl -X POST http://localhost:9090/-/reload              

 

#     netstat   -antl|grep     9090             #查看是否启动成功!

  

 如果我们要采用进程方式管理它,则需要创建脚本:

可以创建一个用户名来启动:

[root@localhost config]# useradd prometheus       

[root@localhost ~]# vim /etc/systemd/system/prometheus.service

 

[Unit]

Description=Prometheus Server

Documentation=https://prometheus.io/docs/introduction/overview/

Deion=prometheus

After=network.target

 

[Service]

Type=simple

User=prometheus

ExecStart=/usr/local/prometheus/prometheus \                        #prometheus安装目录

        -config.file=/usr/local/prometheus/prometheus.yml \        #prometheus安装目录下的prometheus.yml

        -storage.local.path=/home/prometheusdata                    

Restart=on-failure

 

[Install]

WantedBy=multi-user.target

 

说明: -storage.local.path=/home/prometheusdata 指定的存储目录必须要让创建的prometheus用户有权限

保存退出后,此时可以用命令启动 systemctl start prometheus

#      systemctl  enable  Prometheus.service  

#      systemctl  restart  Prometheus.service   

 

2、Grafana 安装

[root@localhost prometheus]# wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.5.0-1.x86_64.rpm

[root@localhost prometheus]# yum install initscripts fontconfig -y

 

[root@localhost prometheus]# rpm -Uvh grafana-4.5.0-1.x86_64.rpm   

warning: grafana-4.5.0-1.x86_64.rpm: Header V4 RSA/SHA1 Signature, key ID 24098cb6: NOKEY

error: Failed dependencies:

urw-fonts is needed by grafana-4.5.0-1.x86_64

    安装发现报错;所以采用如下命令重新安装:

[root@localhost prometheus]# yum localinstall grafana-4.5.0-1.x86_64.rpm

 

[root@localhost prometheus]# service grafana-server start        #启动服务

Starting grafana-server (via systemctl):                   [  OK  ]

[root@localhost prometheus]#  netstat  -anp|grep  3000

查看到3000 端口已经OK; 

页面http://localhost:3000 ,默认账号、密码admin/admin

 

http://docs.grafana.org/installation/rpm/ (gragana 官方文档)

可以将Grafana设置为系统服务

#mkdir-p/var/run/grafana

#chowngrafana.grafana/var/run/grafana

#vim/etc/sysconfig/grafana-server,

添加:PID_FILE_DIR=/var/run/grafan

 

#vim/etc/systemd/system/grafana.service

[Unit]

Description=GrafanaServices

Documentation=https://github.com/grafana/grafana

After=network.target

 

[Service]

EnvironmentFile=/etc/sysconfig/grafana-server

User=grafana

Group=grafana

Type=simple

WorkingDirectory=/usr/share/grafana

RuntimeDirectory=grafana

RuntimeDirectoryMode=0750

ExecStart=/usr/sbin/grafana-server\

--config=${CONF_FILE} \

--pidfile=${PID_FILE_DIR}/grafana-server.pid \ 

cfg:default.paths.logs=${LOG_DIR} \ 

cfg:default.paths.data=${DATA_DIR} \

cfg:default.paths.plugins=${PLUGINS_DIR}

LimitNOFILE=10000

TimeoutStopSec=20UMask=0027

 

[Install]

WantedBy=multi-user.target

 

#以上配置文件中的变量${CONF_FILE}读取的是/etc/sysconfig/grafana-server中的内容

 

#配置文件变更后必须先reload

# systemctl  daemon-reload  

# systemctl  restart grafana.service  

 

# systemctl  enable  grafana.service   

 

Prometheus 和 Grafana 的对接如下:

https://prometheus.io/docs/visualization/grafana/ (prometheus和grafana对接文档)

 

    替换grafana的dashboards

Grafana 并没有太多的配置好的图表模板,除了 Percona 开源的一些外,很多需要自行配置。

[root@localhost prometheus]# yum install git -y

[root@localhost prometheus]# git clone https://github.com/percona/grafana-dashboards.git

Cloning into 'grafana-dashboards'...

remote: Counting objects: 1308, done.

remote: Compressing objects: 100% (31/31), done.

remote: Total 1308 (delta 32), reused 40 (delta 21), pack-reused 1256

Receiving objects: 100% (1308/1308), 6.39 MiB | 1.67 MiB/s, done.

Resolving deltas: 100% (982/982), done.

 

[root@localhost prometheus]# cp -r grafana-dashboards/dashboards /var/lib/grafana/

[root@localhost prometheus]# vim /etc/grafana/grafana.ini

修改如下:

[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards

 

 [root@localhost prometheus]# service grafana-server restart

 或用如下命令重启:

[root@localhost prometheus]# systemctl restart grafana-server

 

 

 3、node_exporter 安装

[root@localhost prometheus]# wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz

 

[root@localhost prometheus]# tar -zxvf node_exporter-0.14.0.linux-amd64.tar.gz

[root@localhost local]# mv /home/prometheus/node_exporter-0.14.0.linux-amd64 ./node_exporter-0.14.0

[root@localhost local]# cd node_exporter-0.14.0/

 

[root@localhost node_exporter-0.14.0]# nohup ./node_exporter &

查看进程是否OK

[root@localhost node_exporter-0.14.0]# ps -ef|grep node_exporter

root     24760 24106  0 14:39 pts/1    00:00:00 ./node_exporter

root     24766 24106  0 14:39 pts/1    00:00:00 grep --color=auto node_exporter

 

 node_exporter 也可做成服务进程启动,

[root@localhost ~]# vim /etc/systemd/system/node_exporter.service

提供的node exporter 的 systemd 脚本如下:

 

[Unit]

Deion=node_exporter

Description=Prometheus node exporter

After=local-fs.target network-online.target network.target

Wants=local-fs.target network-online.target network.target

 

[Service]

Type=simple

User=prometheus                                      #用户prometheus

ExecStart=/usr/local/prometheus/node_exporter/node_exporter

Restart=on-failure

 

[Install]

WantedBy=multi-user.target

#     systemctl      enable   node_export.service 

 

#     systemctl      restart   node_export.service  

 

 

 4、alertManager 安装

http://blog.csdn.net/y_xiao_/article/details/50818451

Prometheus  Alertmanager报警组件         

http://www.jianshu.com/p/239b145e2acc (Prometheus Alertmanager报警组件) 

 

      

 Alertmanager报警模块         

https://github.com/prometheus/alertmanager )(alertmanager gighub)

      

Alert template:           

https://prometheus.io/blog/2016/03/03/custom-alertmanager-templates/ (自定义的alertmanager 模板)     

      

Sending alert notifications to  multiple destinations        

https://www.robustperception.io/sending-alert-notifications-to-multiple-destinations/  (发送提醒到多目的地)

      

Alert  tree:             

https://prometheus.io/webtools/alerting/routing-tree-editor/  (Routing tree editor)

 

 [root@localhost prometheus]# wget https://github.com/prometheus/alertmanager/releases/download/v0.9.1/alertmanager-0.9.1.linux-amd64.tar.gz

 [root@localhost prometheus]# tar -zxvf alertmanager-0.9.1.linux-amd64.tar.gz

 [root@localhost prometheus]# mv alertmanager-0.9.1.linux-amd64 /opt/alertmanager

 [root@localhost prometheus]# cd /opt/alertmanager

 [root@localhost prometheus]# nohup ./alertmanager -config.file=simple.yml &

 

  重启prometheus 服务:

# ./prometheus -config.file=prometheus.yml  -alertmanager.url http://localhost:9093

 

  也可以通过加载配置文件方式而不重启Alertmanager服务:

# curl -XPOST http://localhost:9093/-/reload

 # 设置Alertmanager 系统服务

# vim /etc/systemd/system/alertmanager.service

[Unit]    

Description=Prometheus  Alertmanager.    

Documentation=https://github.com/prometheus/alertmanager  

After=network.target

      

[Service]      

EnvironmentFile=-/etc/alertmanager/template

User=root    

ExecStart=/opt/alertmanager/alertmanager     \     

                                                               -config.file=/opt/alertmanager/simple.yml \     

                                                               -storage.path=/home/alertmanager    \     

                                                               $ALERTMANAGER_OPTS 

ExecReload=/bin/kill  -HUP      $MAINPID   

Restart=on-failure     

      

[Install] 

WantedBy=multi-user.target

 

最后执行:

#     systemctl  enable  alertmanager.service

 

#     systemctl  restrart  alertmanager.service

 

访问Alertmanager页面:http://ip:9093/#/alerts

 

配置 Alertmanager

 

 报警分两部分,报警条件规则文件默认放在Prometheus安装目录下,文件名为 alert.rules。具体通知内容,例如邮件地址和通知人员设置在Alertmanager安装目录下的simply.yml文件,以下是一些基础和常用配置,阈值和时间根据自己需求进行修改。

 

#alert.rules:

 

ALERT node_down  

  IF up == 0 AND job="node"  

  FOR 5m    

  ANNOTATIONS {

    summary = "Node is down",  

    description = "Node has been unreachable for more than 5 minutes.",

    severity = "warning"  

  }  

 

ALERT snmp_down  

  IF up == 0 AND job="snmp"  

  FOR 5m    ANNOTATIONS {

    summary = "SNMP is down",  

    description = "SNMP has been unreachable for more than 5 minutes.",

    severity = "warning"  

  }  

 

ALERT fs_at_80_percent  

  IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8

  FOR 15m  

  ANNOTATIONS {  

    summary = "File system {{$labels.hrStorageDescr}} is at 80%",  

    description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.",

    severity = "warning"  

  }  

 

ALERT fs_at_90_percent  

  IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9  

  FOR 15m 

  ANNOTATIONS {  

    summary = "File system {{$labels.hrStorageDescr}} is at 90%",  

    description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.",

    severity = "average"  

  }  

 

ALERT disk_load_mostly_random_reads  

  IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND

  rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000 

  FOR 15m  

  ANNOTATIONS {    summary = "Disk {{$labels.diskIODevice}} reads are mostly random.",

    description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.",

    severity = "info"  

  }  

 

ALERT disk_load_mostly_random_writes  

  IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND

  rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000  

  FOR 15m

  ANNOTATIONS {  

    summary = "Disk {{$labels.diskIODevice}} writes are mostly random.",  

    description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.",

    severity = "info"  

  }  

 

ALERT disk_load_high  

  IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30  

  FOR 15m  

  ANNOTATIONS {  

    summary = "Disk {{$labels.diskIODevice}} is at 30%",  

    description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.",  

    severity = "warning"  

  }  

 

ALERT cpu_load_high

  IF ssCpuIdle < 70  

  FOR 15m

  ANNOTATIONS {

    summary = "CPU is at 30%",  

    description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.",

    severity = "warning"  

  }  

 

ALERT linux_load_high  

  IF laLoad1 > 50

  FOR 15m  

  ANNOTATIONS {  

    summary = "Linux Load is at 40",  

    description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.",

    severity = "average"  

  }  

 

ALERT if_operstatus_changed  

  IF delta(ifOperStatus[15m]) != 0  

  ANNOTATIONS {     

    summary = "Port {{$labels.ifDescr}} changed status",  

    description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes",

    severity = "info"  

  }  

 

ALERT if_traffic_at_30_percent

  IF ifSpeed > 10000000 AND

  ifOperStatus == 1 AND  

    rate(ifInOctets[5m]) > ifSpeed * 0.3  

  FOR 15m  

  ANNOTATIONS {  

    summary = "Port {{$labels.ifDescr}} is at 30%",  

    description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.",  

    severity = "warning"  

  }  

 

ALERT if_traffic_at_70_percent

  IF ifSpeed > 10000000 AND

  ifOperStatus == 1 AND  rate(ifInOctets[5m]) > ifSpeed * 0.7

  FOR 15m  

  ANNOTATIONS {  

    summary = "Port {{$labels.ifDescr}} is at 70%",  

    description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.",  

    severity = "average"  

  }

 

# CPU告警 

ALERT cpu_overload

  IF node_load1 >= 0.8

  FOR 3m

  LABELS { severity = "all" }

  ANNOTATIONS {

    summary = "Instance {{ $labels.instance }} cpu_load1 over 80% for 3 minutes",

    description = "{{ $labels.instance }} of job {{ $labels.job }} cpu_load1 over 80% for 3 minutes.",

  }

 

# 内存告警

ALERT memory_overload

  IF (node_memory_MemTotal-node_memory_MemFree)/node_memory_MemTotal >= 0.8

  FOR 3m

  LABELS { severity = "all" }

  ANNOTATIONS {

    summary = "Instance {{ $labels.instance }} memory_load over 80% for 3 minutes",

    description = "{{ $labels.instance }} of job {{ $labels.job }} memory_load over 80% for 3 minutes.",

  }

---------------------------------------------------

#     simply.yml   

      

     主要分三部分,Global部分设置发送邮件服务器信息,route设置规则和报警时间间隔等,receivers设置接收人。

global:

#设置发送邮件的地址和smtp信息

smtp_smarthost:'smtp.abc.com'

smtp_from:'prometheus@abc.com'

smtp_auth_username:'prometheus'

smtp_auth_password:'abcd’

 

route:receiver:'team-X-mails'group_by:['alertname']group_wait:30s

group_interval:5m

repeat_interval:6h

 

inhibit_rules:

  -source_match:

    severity:'critical'

  target_match:

    severity:'warning'

#Applyinhibitionifthealertnameisthesame.

  equal:['alertname']

 

receivers:

  -name:'team-X-mails'

    email_configs:

    -to:'support@abc.com'

      send_resolved:true

 

#设置完毕后需要重新加载配置文件 

 

 

 

5、cadvisor 安装配置

       docker run  -d --restart=always --volume=/:/rootfs:ro   --volume=/var/run:/var/run:rw   --volume=/sys:/sys:ro   --volume=/var/lib/docker/:/var/lib/docker:ro   --volume=/dev/disk/:/dev/disk:ro   --publish=8090:8080   --detach=true   --name=cadvisor google/cadvisor:latest

 

 在浏览器中:http://ip:8090  就可以访问了

# 监控cAdvisor报警条件:  

# vim containers.rules  

 

ALERT cAdvisor_down  

  IF absent(container_memory_usage_bytes{name="cadvisor"})  

  FOR 1m  

  LABELS { severity = "critical" }  

  ANNOTATIONS {  

   summary= "cAdvisor containers down",  

   description= "cAdvisor container is down for more than 1 minutes."  

  }  

 

ALERT cAdvisor_high_cpu  

  IF sum(rate(container_cpu_usage_seconds_total{name="cadvisor"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10  

  FOR 5m  

  LABELS { severity = "warning" }  

  ANNOTATIONS {  

    summary= "cAdvisor high CPU usage",  

    description= "cAdvisor CPU usage is {{ humanize $value}}%."  

  }  

 

ALERT cAdvisor_high_memory  

  IF sum(container_memory_usage_bytes{name="cadvisor"}) > 1200000000 FOR 5m  

  LABELS { severity = "warning" }

  ANNOTATIONS {  

      summary = "cAdvisor high memory usage",  

      description = "cAdvisor memory consumption is at {{ humanize $value}}.",  

  }

 

 

 

 

 

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics