Monitoring power state on virtual machines

We have a VMware datacenter and a cluster that hosts a number of virtual hosts that is not connected to the rest of our network. As we already have a Nagios solution that monitors most of our infrastructure, we wanted to, at least monitor power state, of the network unreachable hosts.

One way of doing this, might be to use the VMware Infrasrtucture Toolkit, but alas, the running time of such scripts, have been to long (first of all been a problem within Munin monitoring). The other solution (there might be more that I do not know about), is to use SNMP.

At least ESX3 and 3.5 ships with a SNMP agent that provides a set if VMware OIDS ( /enterprises.6876). From vSphere 4, VMware ships two SNMP daemons; one based on ucd-snmp without the VMware OIDs and one built info the vmware-hostd process having these OIDs.

The first branch of this OID contains version info for the ESX server.

$ snmpwalk -Os -Cc -c public -v 2c kaffe enterprises.6876.1
enterprises.6876.1.1.0 = STRING: "VMware ESX"
enterprises.6876.1.2.0 = STRING: "4.0.0"
enterprises.6876.1.4.0 = STRING: "208167"

The second branch contains status and info for the ESX' virtual machines.

Using Perl and Net::SNMP it was fairly simple make a Nagios plugin that could query an ESX server and check if a named virtual host was running. As we are using a cluster with DRS and HA, I extended the script so it can query an endless number of ESX servers.

$ /usr/lib/nagios/plugins/check_vmware_status -N kamuf -N emba -N kaffe -H provis-ts
provis-ts is on (vmwaretools running) on node kaffe

As one can see, this also provides information about the current state of VMware Tools.

A quick guide enable the SNMP service in vmware-hostd:
First, log on to the ESX server as a privileged account (root).
Query the runlevel startup info for snmpd

# chkconfig --list snmpd
snmpd 0:off 1:off 2:on 3:on 4:on 5:on 6:off

If this contains no "on" in any runlevel, you can skip to the firewall configuration.

If snmpd is enabled, follow these two steps:

# service snmpd stop

(If it says FAILED, do not worry, but continue to the nest step)

# chkconfig snmpd off

The last step to complete on the ESX server, is to make sure the firewall will allow incomming traffic to the SNMP service:

# esxcfg-firewall -o 161,udp,in,SNMP

If you do not have VMware vSphere Command-Line Interface installed, download and install on a machine that can reach the ESX servers.

(If running Ubuntu, you might want to install libuuid-perl and libclass-methodmaker-perl)

Use the vicfg-snmp command to configure and enable the service:

$ vicfg-snmp --server kaffe --username root --communities public --enable

To check, use some SNMP tool (I prefere snmpwalk):

$ snmpwalk -Os -Cc -c public -v 2c kaffe enterprises.6876.1.1
enterprises.6876.1.1.0 = STRING: "VMware ESX"

To use this in Nagios, you will have to download the check_vmware_status. I prefere to place it within the Nagios plugin directory (our installation uses /usr/lib/nagios/plugins). Make sure it is runnable (chmod +r).

I've added a command to Nagios config (commands.cfg)

define command{
command_name check_vmware_status
command_line $USER1$/check_vmware_status -N kamuf -N emba -N kaffe -H $HOSTNAME$

The -N's is all our ESX servers (nodes if you like) that is part of the VMware cluster.

I've made a host group for all the virtual machines we want to monitor and made this service config (services_nagios2.cfg):

define service{
use generic-service ; Name of service template to use
hostgroup_name virtual-machines
service_description VMWARE-ON
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups vm-admins
notification_interval 120
notification_period workhours
notification_options c,r
check_command check_vmware_status

An example of a host configuration:

define host{
use generic
host_name provis-ts
alias Terminal server

An example of a hostgroup configuration:

define hostgroup{
hostgroup_name virtual-machines
alias Virtual machines
members provis-ts, provis-dc