Ganglia工具用于监测服务器集群状态

brandNewUser

浏览: 446112 次
性别:
来自: 北京

最近访客更多访客>>

yin_bp

ruize

candle_huihui

mwj3970839

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

系统使用
工具使用

ganglia

Ganglia的基本结构

Ganglia是由gmond, gmetad和gweb三部分组成。

gmond（Ganglia Monitoring Daemon）是一种轻量级服务，安装在每台需要收集指标数据的节点主机上。gmond在每台主机上完成实际意义上的指标数据收集工作，并通过侦听/通告协议和集群内其他节点共享数据。使用gmond，你可以很容易收集很多系统指标数据，如CPU、内存、磁盘、网络和活跃进程的数据等。

gmetad（Ganglia Meta Daemon）是一种从其他gmetad或gmond源收集指标数据，并将其以RRD格式存储至磁盘的服务。gmetad为从主机组收集的特定指标信息提供了简单的查询机制，并支持分级授权，使得创建联合监测域成为可能，该组件真正对gmond集群进行轮询，并将指标数据保存到硬盘。

gweb（Ganglia Web）gweb是一种利用浏览器显示gmetad所存储数据的PHP前端。在Web界面中以图表方式展现集群的运行状态下收集的多种不同指标数据。

ganglia安装和配置

可以直接使用yum命令来进行ganglia的安装：

yum install ganglia-gmond -y
yum install ganglia-gmetad -y
yum install httpd php

安装ganglia-web只需要在主节点上安装，下载对应的tar包，修改MakeFile，执行make install，并安装在apache上启动httpd进程，就可以。

GDESTDIR = /var/www/html/ganglia2

同样地gmetad进程也不需要每个节点都存在，但是每个节点都需要有gmond进程，用于向gmetad上传数据。

如果这种方式启动，所有的服务都属于同一个分组（unspecified），如果我们有测试环境和solr环境需要对其进行分组统计压力，则需要对其进行配置的修改，在gmetad服务中：/etc/ganglia/gmetad.conf配置文件中，修改datasource部分：

data_source "Test" localhost 192.168.1.xx:8650
data_source "Solr" 192.168.1.2xx:8649

这表示数据来源，后面配置的ip分别为每个组的组长，但组长是不需要配置gmetad服务的，每个组长需要分配不同的端口号（因为IP是单播IP路由指南，ganglia实际的传输数据通过多播IP来进行）。

客户端配置在 /etc/ganglia/gmetad.conf 中，相对来说就比较复杂，根据具体的分组配置对应的cluster name以及端口即可：

cluster {
  name = "Solr"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}
 
/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}
 
/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  #bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname.  Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
  mcast_join = 239.2.11.71
  port = 8649
  ttl = 1
}
 
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.71
  port = 8649
  bind = 239.2.11.71
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}

当前使用的Linux系统版本可以通过 cat /etc/issue查看：CentOS release 6.5，如果需要使用yum安装软件，本机保证能够连接到外网，但我们的服务器并不总是这样（外网ip数量有限，需要购买，而且不便宜），因此需要设置固定的ip值，重新启动，然后随时切换ip。

需要编辑网卡地址：/etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
HWADDR=00:50:56:9x:4x:A0
TYPE=Ethernet
UUID=49cb486a-1d12-44d7-b7c1-b012xx51823f
ONBOOT=no
NM_CONTROLLED=yes
BOOTPROTO=static
DNS1=192.168.1.1
IPADDR=xxx
GATEWAY=xxx
NETMASK=255.255.255.240

设置对应的IPADDR，GATEWAY，NETMASK，如果DEVICE=eth0，则需要使用下面的命令启动网卡：

>ifup eth0
Determining if ip address xxx is already in use for device eth0...

系统会判断该ip是否可用，并将其启用，启用完成后，使用ifconfig命令就可以看到对应的外网ip，此时就可以通过命令ifdown命令关闭对应的网卡即可。

如果没有配置的话，ganglia会默认使用eth0网卡，而eth0网卡一般被配置成外网地址，如果想要配置成使用内网网卡（例如为eth1），则在启动gmetad和gmond进程之前都需要执行下面的命令：

ip route add 239.2.11.71 dev eth1

这里的ip：239.2.11.71为组播地址，在gmond.conf配置文件中设置。注意在设置完成之后，需要将gmond服务重新启动才能生效。

分组完成之后，就可以在ganglia首页中看到所有分组，以及其物理视图（physical view，静态信息，当前服务器的CPU，内存等配置信息）：

以及首页上的根据分组对应的服务器列表，这样更加能够对每个分组中服务器的特性进行监控，例如nginx比较耗费cpu核数以及网络带宽，metaq比较耗费硬盘，solr耗费cpu...针对不同类型的服务器，可以做到相应的监控报警：

附一张完整的首页整体图：

清除ganglia运行数据，ganglia的运行数据在gmetad节点上的目录：/var/lib/ganglia/rrds，/var/lib/ganglia-web/dwoo/cache（默认值）下，将该目录下的文件删除并重启gmetad进程即可。

监控JVM

借助jmxtrans，可以监测java进程内部的数据指标：

JMXTrans官网介绍写道

jmxtrans is a tool which allows you to connect to any number of Java Virtual Machines (JVMs) and query them for their attributes without writing a single line of Java code. The attributes are exported from the JVM via Java Management Extensions (JMX). Most Java applications have made their statistics available via this protocol and it is possible to add this to any codebase without a lot of effort. If you use the SpringFramework for your code, it can be as easy as just adding a couple of annotations to a Java class file.

jmxtrans工具的安装可以直接从官网上下载对应的rpm包并命令安装即可：

sudo rpm -ivh jmxtrans-287e3ce6fe-0.noarch.rpm
Preparing...                ########################################### [100%]
   1:jmxtrans               ########################################### [100%]

安装完成后，jmxtrans并没有启动，需要增加配置文件: /var/lib/jmxtrans/solr.conf，该文件为收集java vm jmx的相关参数。我们当前监测的节点为solr，对应的host必须为-Djava.rmi.server.hostname设置的名称，端口为-Dcom.sun.management.jmxremote.port对应的端口：

{
“servers” : [
{
"host" : "127.0.0.1",
"alias" : "solr",
"port" : "3000",
"queries" : [
{
"obj" : "java.lang:type=Memory",
"resultAlias": "solr1.heap",
"attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
},
{
“obj” : “java.lang:name=CMS Old Gen,type=MemoryPool”,
“resultAlias”: “solr1.cmsoldgen”,
“attr” : [ "Usage" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
},
{
“obj” : “java.lang:type=GarbageCollector,name=*”,
“resultAlias”: “solr1.gc”,
“attr” : [ "CollectionCount", "CollectionTime" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
},
{
“obj” : “java.lang:type=Threading”,
“resultAlias”: “solr1.threads”,
“attr” : [ "DaemonThreadCount", "PeakThreadCount", "ThreadCount", "TotalStartedThreadCount" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
},
{
“obj” : “solr/collection1:type=queryResultCache,id=org.apache.solr.search.LRUCache”,
“resultAlias”: “solr1.queryCache”,
“attr” : [ "warmupTime","size","lookups","evictions","hits","hitratio","inserts","cumulative_lookups"
,"cumulative_hits","cumulative_hits","cumulative_hitratio","cumulative_inserts","cumulative_evictions" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
},
{
“obj” : “solr/collection1:type=searcher,id=org.apache.solr.search.SolrIndexSearcher”,
“resultAlias”: “solr1.searcher”,
“attr” : [ "maxDoc","numDocs","warmupTime" ],
“outputWriters” : [
{
"@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
"settings" : {
"groupName" : "Product-JVM",
"host" : "239.2.11.71",
"port" : "8649"
}
}]
}]
}]
}

至于其他的选项，直接参考对应的应用中的jmx选项即可，并配置对应的名称，host以及port需要对应ganglia组播的主机名称以及datasource端口。

配置完成后，就可以直接启动了，但启动需要满足jps安装在某个特定位置：

sudo /etc/init.d/jmxtrans start
正在启动 jmxtrans：Cannot execute /usr/bin/jps -l!
                                                           [失败]

如果没有安装成功，修改 /usr/share/jmxtrans/jmxtrans.sh 文件指定其他jps 目录即可。

传输成功后，就可以在该服务器中能够查看到对应的metric group正常显示出来，以下是配置的product heap jmx相关属性：

Ganglia的一些实用工具

更多的用户贡献的module，请查看 https://github.com/ganglia/gmond_python_modules

ganglia-alert ：获取gmetad数据，并报警 https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-alert

ganglia-docker：在docker中使用ganglia，https://github.com/ganglia/ganglia_contrib/tree/master/docker

gmetad-health-check：监控gmetad服务状态，如果down掉，则restart服务， https://github.com/ganglia/ganglia_contrib/tree/master/gmetad_health_checker

chef-ganglia：用chef部署ganglia， https://github.com/ganglia/chef-ganglia

ansible-ganglia: 使用ansible自动化部署ganglia，https://github.com/remysaissy/ansible-ganglia

ganglia-nagios：集成nagios和ganglia，https://github.com/ganglia/ganglios

ganglia-api ：对外提供rest api，以特定格式返回gmetad收集到的数据， https://github.com/guardian/ganglia-api

自己实现监控数据指标：Metrics

Metrics 是一个 Java 库，提供了用于记录系统指标的各种工具，基本上是我们自己实现的 MetricMBean 的最佳替代品，功能强大，并且支持很多常用组件如 Jetty，Ehcache，Log4j 等，并且可以发送数据到 Ganglia。如果早点发现这个，我可能就不会自己写上面介绍的那一套方案了。对了，它还有 Clojure 绑定，如果是 Clojure 应用，那更可以考虑使用它了。

ganglia登录验证

http://blog.163.com/digoal@126/blog/static/163877040201481552751497/

关于ganglia认证登录：https://yq.aliyun.com/articles/9032?spm=5176.100240.searchblog.15.ujMOE1