Rewrite the VMware monitoring

Transcrição

Rewrite the VMware monitoring
check_vmware_esx.pl – a rewrite
Martin Fürstenau
Oce Printings Systems Gmbh & Co.KG
[email protected]
Zur Person
● Senior System Eningeer bei Oce Printing Systems Gmbh &
Co. KG in Poing bei München
● 27 Jahre IT, 24 Jahre Unix, 19 Jahren Linux, 9 Jahre Oce
● Derzeitiger Schwerpunkt
● Betreung von etwa 120 Linux Maschinen (CentOS und
RedHat)
● Monitoring (Betreuung, Entwicklung von Plugins und
Addons, Weiterentwicklung d. Plattform)
Plugins,Erweiterungen und ich
● Entwicklung von Plugins und Erweiterungen für
● Brocade FC Switches
● Netzwerkinterfaces f. Solaris, Linux, Windows u. NetApp
● Netzwerkkarten Failover (Windows, Linux u. Solaris)
● Verarbeitung von Alarmen, die per Email kommen.
● Kopplung Helpdesksystem (Topdesk)
● Cisco Switches und WLCs
● CPU Usage, Memory (Linux, Windows,
Solaris,VMware,McAffee Webgateway)
● RAID Controllern
● URL (auch über Proxy mit Authentifizierung)
● LUNs unter Windows (NetApp)
● VMware Überwachung
Oce European Data Center - Monitoring
● Rechenzentrum von Océ Printing Systems, Poing
● Lokale IT und Großteil der unternehmensweiten IT
● Netzwerk (Lokal, Corporate und Europa)
● Unser Mengengerüst
● 1800 Hosts
● Davon 50 % mit MS Windows
● Über 160 Netzwerkkomponenten (Switches,
Router,Firewalls)
● 16900 Services
● Davon ca 50% auf MS Windows
● Der Rest ist hauptsächlich Unix/Linux, SAN, NetApp
Filer und Netzwerk
Was treibt einen zum Rewrite?
● check_vmware_api.pl (früher check_esx3.pl)
● ist etabliert
● wird vielfach eingesetzt
● macht doch das was es soll
● 4650 Zeile bestens gepflegter Code
● Offen für Veränderungen
Was treibt einen zum Rewrite?
./check_vmware_api.pl -H 192.168.51.3 -u nagios -p MeinPasswort -l runtime -s health
CHECK_VMWARE_API.PL OK - 178 health issue(s) found in 178 checks:
1) UNKNOWN[System] Status of System Management Software 0 NMI 0: NMI/Diag Interrupt unknown: Cannot report on the current health state of the element
2) UNKNOWN[System] Status of System Management Software 0 NMI 0: Software NMI unknown: Cannot report on the current health state of the element
3) UNKNOWN[System] Status of System Management Software 0 NMI 0: Fatal NMI - unknown:
Cannot report on the current health state of the element
4) UNKNOWN[System] Status of System Board 0 CPU detection 0: Undetermined system
hardware failure - unknown: Cannot report on the current health state of the element
5) UNKNOWN[fan] Status of Power Supply 8 FAN PSU2 --- Normal: Cannot report on the
current health state of the element
6) UNKNOWN[fan] Status of Power Supply 4 FAN PSU1 --- Normal: Cannot report on the
current health state of the element
7) UNKNOWN[fan] Status of Fan Device 4 FAN5 SYS --- Normal: Cannot report on the
current health state of the element
8) UNKNOWN[fan] Status of Fan Device 3 FAN4 SYS --- Normal: Cannot report on the
current health state of the element
9) UNKNOWN[fan] Status of Fan Device 2 FAN3 SYS --- Normal: Cannot report on the
current health state of the element
10) UNKNOWN[fan] Status of Fan Device 1 FAN2 SYS --- Normal: Cannot report on the
current health state of the element
….....
176) UNKNOWN[temperature] Status of System Board 0 Systemboard 2 --- Normal: Cannot
report on the current health state of the element
177) UNKNOWN[temperature] Status of System Board 0 Systemboard 1 --- Normal: Cannot
report on the current health state of the element
178) UNKNOWN[temperature] Status of External Environment 0 Ambient --- Normal: Cannot
report on the current health state of the element | Alerts=178;;
Was treibt einen zum Rewrite?
./check_vmware_api.pl -H 192.168.51.3 -u nagios -p MeinPasswort -l storage -s lun
CHECK_VMWARE_API.PL OK - Local FTS CORP Enclosure Svc Dev (naa.500605b0000272bd)
<ok>; Local USB CD-ROM (mpx.vmhba32:C0:T0:L0) <ok>; Local Optiarc CD-ROM
(mpx.vmhba3:C0:T0:L0) <ok>; NETAPP Fibre Channel Disk
(naa.60a9800064666b72684a72434a2d3670) <ok>; Local LSI Disk (naa.6003005700ec0d4016
37f5e 50be9336f) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a7333353579
52) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c3465704841364f) <ok>; NET
APP Fibre Channel Disk (naa.60a98000646648654c3465704841514e) <ok>; NETAPP Fibre Chan
nel Disk (naa.60a98000646648654c34657048395643) <ok>; NETAPP Fibre Channel Disk (naa.
60a9800064666b72684a6570454b6a5a) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648
654c346d3548496370) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a664f6a61
4e77) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346d354744434e) <ok>; NE
TAPP Fibre Channel Disk (naa.60a9800064666b72684a733349796749) <ok>; NETAPP Fibre Cha
nnel Disk (naa.60a9800064666b72684a676767795530) <ok>; NETAPP Fibre Channel Disk (naa
.60a98000646648654c34714e4a703049) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666
b72684a736146337367) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a7249677
65442) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a724961426b37) <ok>; N
ETAPP Fibre Channel Disk (naa.60a9800064666b72684a7252357a4e6c) <ok>; NETAPP Fibre Ch
annel Disk (naa.60a98000646648654c346570482d6b6d) <ok>; NETAPP Fibre Channel Disk (na
a.60a9800064666b72684a6570454d6663) <ok>; NETAPP Fibre Channel Disk (naa.60a980006466
48654c34704439683335) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34715568
385779) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a6570454c5661) <ok>;
NETAPP Fibre Channel Disk (naa.60a98000646648654c34657048416961) <ok>; NETAPP Fibre C
hannel Disk (naa.60a9800064666b72684a715568707049) <ok>; NETAPP Fibre Channel Disk (n
aa.60a9800064666b72684a657045486947) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646
66b72684a6570454a4350) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a72496
1326143) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346c2f61565253) <ok>;
NETAPP Fibre Channel Disk (naa.60a9800064666b72684a657045497277) <ok>; NETAPP Fibre
Channel Disk (naa.60a9800064666b72684a7475532f5546) <ok>; NETAPP Fibre Channel Disk
(naa.60a98000646648654c34676767773437) <ok>; NETAPP Fibre Channel Disk (naa.60a980006
Was treibt einen zum Rewrite?
4666b72684a717659744676) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a736146
705a53) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a703952666377) <ok>; NET
APP Fibre Channel Disk (naa.60a9800064666b72684a676767796c6f) <ok>; NETAPP Fibre Channel
Disk (naa.60a98000646648654c346c2f61535956) <ok>; NETAPP Fibre Channel Disk (naa.60a9800
064666b72684a6f466762434d) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346c2f
6159664c) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a657045505178) <ok>; N
ETAPP Fibre Channel Disk (naa.60a98000646648654c346570482d3177) <ok>; NETAPP Fibre Chann
el Disk (naa.60a9800064666b72684a726430493364) <ok>; NETAPP Fibre Channel Disk (naa.60a9
8000646648654c34657048423277) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346
570482d5151) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34664f6a4f4c72) <ok>
; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a723230647362) <ok>; NETAPP Fibre Ch
annel Disk (naa.60a9800064666b72684a6570454d3469) <ok>; NETAPP Fibre Channel Disk (naa.6
0a9800064666b72684a724968336855) <ok>; | LUNs=51units;;
Was treibt einen zum Rewrite?
● Die Hilfe und Optionen:
● Undurchsichtig, fehlerhaft und unsinnig (z.B.)
-l, --command=COMMAND
+ usagemhz - CPU usage in MHz
o breif - list only alerting volumes
o quickstats - switch for query either PerfCounter values or
Runtime info
T (value) - timeshift to detemine if we need to refresh
-i, --interval=<sampling period> und -M,
--maxsamples=<max sample count>
● Zum Teil irreführend/falsch
● Vorgeschlagene Patches/Änderungen fliessen nicht mit ein
bzw. brauchen ewig.
Was treibt einen zum Rewrite?
● Fehler im Code
● Fehlinterpretation von Performancecountern
● Die (Code)qualität.
● Verschachtelt (elsif).
● Unterschiedliche Lösungen für dieselbe Aufgabe (z.B.
while/until)
● Viele Redundanzen (z.B. in host_runtime_info() 4 x
Sensorik)
● Nicht modular (schwer wartbar).
● Nagios::Plugins verwendet, aber nicht konsequent).
● Unsinnige Performancedaten.
● Mangelnde Doku (z.B. Kommentare)t).
Verarbeitung von historischen Daten
● Zum Thema timeshift, sample interval etc.:
● The PerformanceManager object manages performance
statistics collected from various components, such as a
host, virtual machine, clusters and resource pools. The
collection of performance statistics is associated with but not
limited to managed entities defined for the object model. Those
managed entities that are capable of returning performance
statistics are Performance Providers. The capabilities of
performance providers can be retrieved using
(QueryPerfProviderSummary). In PerformanceManager, three
sets of methods are used to perform the following:
● Create, remove, or update intervals for historical statistics.
● Query performance statistics.
● Query metadata information about performance statistics
counters.
….und die quickstats
● Data Object – VirtualMachineQuickStats
A set of statistics that are typically updated with near realtime regularity. This data object type does not support
notification, for scalability reasons. Therefore, changes in
QuickStats do not generate property collector updates. To
monitor statistics values, use the statistics and alarms
modules instead.
https://www.vmware.com/support/developer/vcsdk/visdk2xpubs/ReferenceGuide/vim.vm.Summary.QuickSt
ats.html
….und noch mehr quickstats
● Data Object – HostListSummaryQuickStats
Included in the host statistics are fairness scores. Fairness
scores are represented in units with relative values, meaning
they are evaluated relative to the scores of other hosts. They
should not be thought of as having any particular
absolute value. Each fairness unit represents an increment
of 0.001 in a fairness score. The further the fairness score
diverges from 1, the less fair the allocation. Therefore, a
fairness score of 990, representing 0.990, is more fair than a
fairness score of 1015, which represents 1.015. This is
because 1.015 is further from 1 than 0.990.
https://www.vmware.com/support/developer/vcsdk/visdk2xpubs/ReferenceGuide/vim.host.Summary.QuickS
tats.html
Was macht das Ding eigentlich?
● Alle Performancedaten in op5 (Sinnvoll bei Nagios?)
● Commandline Ersatz
● kein VMware Linux Client mehr
● CLI/Remote CLI ziemlich Hardcore
● Monitoring.
● Problem:
● Doku und Hilfe nicht parallel zur Entwickung gepflegt
● Etliche Optionen ok f. 1) oder 2) aber nicht verwertbar
f. 3).
Ziel: transparent - pflegbar - erweiterbar
● Reduktion auf das für das Monitoring
Wesentliche
● Entfernen unnötiger Features und Optionen
● Modularisierung des Codes
● Reformatieren des gesamten Codes
● übersichtlichere Codestruktur
● Einheitliches Codebild/Entflechtung des Codes
● Einheitliche Konstrukte
● Einfügen von Kommentaren
● Eine relativ ausführliche Historie
What has been done.....
● Unnötige Performancdaten entfernt (z.B. Anzahl v. Controllern,
Anzahl laufende VM,Anzahl Netzwerkkarten.....)
● PATH to MPATH and help from "path - list logical unit paths" to
"mpath - list logical unit multipath info" because it is NOT an
information about a path - it is an information about multipathing.
● Removed installation informations for the perl SDK from Vmware.
● Replaced global variables with my variables.
● Variablendefinitionen grundsätzlich am Anfang von Funktionen
● Replaced all die with a normal if statement and an exit.
● unless -> if und until -> while
...and done.....
● Nagios::plugin entfernt
● Literale wie CRITICAL durch numerische Werte ersetzt
● return_cluster_DRS_recommendations() entfernt. Unsinnig für
●
●
●
●
Alarmierung
Main selection -> subroutine main_select (Wegfall elsif
“Marathon”)
Stripped down vm_cpu_info. Monitoring CPU usage in Mhz macht
keinen Sinn unter normalen Umständen
$value1 - $valuen entfernt
swap in vm_mem_info() entfernt. Aus der vmware documentation:
"Current amount of guest physical memory swapped out to the
virtual machine's swap file by the VMkernel. Swapped memory
stays on disk until the virtual machine needs it. This statistic refers
to VMkernel swapping and not to guest OS swapping. swapped =
swapin + swapout". This is more an issue of performance tuning
rather than alerting. It is not swapping inside the virtual machine.
...and done.....
● OVERHEAD in vm_mem_info() entfernt. vmware documentation:
"Amount of machine memory used by the VMkernel to run the virtual
machine." So using this we have a useless information about a virtual
machine because we have no valid context and we have no valid
thresholds. More important is overhead for the host system. And if we are
running in problems here we have to look which machine must be moved
to another host.
As a result of this overall in vm_mem_info() makes no sense.
● swap in vm_mem_info() entfernt. vmware documentation:
"Amount of guest physical memory that is currently reclaimed from the
virtual machine through ballooning.This is the amount of guest physical
memory that has been allocated and pinned by the balloon driver."
So here we have again data which makes no sense used alone. You need
the context for interpreting them and there are no thresholds for alerting.
...and done.....
● Reimplemented subselect ready in vm_cpu_info and implemented it new
in host_cpu_info. From the vmware documentation:
"Percentage of time that the virtual machine was ready, but could not get
scheduled to run on the physical CPU. CPU ready time is dependent on
the number of virtual machines on the host and their CPU loads."
High or growing ready time can be a hint CPU bottlenecks (host and guest
system)
● Reimplmented subselect wait in vm_cpu_info and implemented it new in
host_cpu_info. From the vmware documentation:
"CPU time spent in wait state. The wait total includes time spent the CPU
Idle, CPU Swap Wait, and CPU I/O Wait states. "
High or growing wait time can be a hint I/O bottlenecks (host and guest
system)
...and done.....
● Removed subroutines return_dc_performance_values, dc_cpu_info,
dc_mem_info, dc_net_info and dc_disk_io_info.
Monitored entity was view type HostSystem. This means, that the CPU of
the data center server is monitored.
● Replaced $command and $subcommand with $select and $subselect.
Therfore also the options -l command and -s subcommand changed
● Kicked out all (I hope so) code for processing historic data from
generic_performance_values(). generic_performance_values() is called by
return_host_performance_values(),
return_host_vmware_performance_values() and
return_cluster_performance_values()
(return_cluster_performance_values() must be rewritten now).
● The code length of generic_performance_values() was reduced to one
third by doing this.
...and done.....
● Changed select option for datastore from vmfs to volumes because we will
have volumes on nfs AND vmfs.
● Added volume type to datastore_volumes_info(). So you can see whether
the volume is vmfs (local or SAN) or NFS.
● Rewritten and cleaned subroutine host_disk_io_info().
● Changed the output. Opposite to vm_disk_io_info() most values in
host_disk_io_info() are not transfer rates but latency in milliseconds.
The output is now clearly understandable.
● Added subselect read. Average number of kilobytes read from the disk
each second. Rate at which data is read from each LUN on the
host.read rate = # blocksRead per second x blockSize.
● Added subselect write. Average number of kilobytes written to disk each
second. Rate at which data is written to each LUN on the host.write rate
= # blocksRead per second x blockSize
● Added subselect usage. Aggregated disk I/O rate. For hosts, this metric
includes the rates for all virtual machines running on the host.
...and done.....
● Changed "eval { require VMware::VIRuntime };" to "use
VMware::VIRuntime;". The eval construct made no sense. If the module
isn't available the program will crash with a compile error.
● Moved host_device_info to host_mounted_media_info.
Opposite to it's name and the description this function wasn't designed to
list all devices on a host. It was designed to show host cds/dvds mounted
to one or more virtual machines. This is important for monitoring because
a virtual machine with a mount cd or dvd drive can not be moved to
another host. Added check for host floppy
● Added SOAP check from Simon Meggle, Consol. Slightly modified to fit.
● Added isblacklisted and isnotwhitelisted from Simon Meggle, Consol. . Enhanced host_mounted_media_info.pm
...and done.....
●
host_runtime_info().
●
Filtered out the sensor type "software components".
●
Kicked out maintenance info in runtime summary and as subselect. In
the beginning of the function is a check for maintenance. In the original
program in this case the program will be left with a die which caused a
red alert in Nagios. Now an info is displayed and a return code of 1
(warning) is deliverd because a maintenance is regular work but there
should be a notice.
● listvms
● Connection info gemäss Handbuch.
● In case of no VMs the plugin returned a critical. But this is not correct.
No VMs on a host is not an error. It is simply what it says: No VMs.
Usability
● überarbeitete Hilfe
● neue Optionen (--multiline oder –alertonly)
● überarbeitete Optionen
● --option statt -o option
Wie geht es weiter?
● Überarbeitung abschließen
● später neue Funktionen realisieren.
● Und zum mithelfen:
https://github.com/BaldMansMojo/check_vmware_esx

Documentos relacionados