Rewrite the VMware monitoring
Transcrição
Rewrite the VMware monitoring
check_vmware_esx.pl – a rewrite Martin Fürstenau Oce Printings Systems Gmbh & Co.KG [email protected] Zur Person ● Senior System Eningeer bei Oce Printing Systems Gmbh & Co. KG in Poing bei München ● 27 Jahre IT, 24 Jahre Unix, 19 Jahren Linux, 9 Jahre Oce ● Derzeitiger Schwerpunkt ● Betreung von etwa 120 Linux Maschinen (CentOS und RedHat) ● Monitoring (Betreuung, Entwicklung von Plugins und Addons, Weiterentwicklung d. Plattform) Plugins,Erweiterungen und ich ● Entwicklung von Plugins und Erweiterungen für ● Brocade FC Switches ● Netzwerkinterfaces f. Solaris, Linux, Windows u. NetApp ● Netzwerkkarten Failover (Windows, Linux u. Solaris) ● Verarbeitung von Alarmen, die per Email kommen. ● Kopplung Helpdesksystem (Topdesk) ● Cisco Switches und WLCs ● CPU Usage, Memory (Linux, Windows, Solaris,VMware,McAffee Webgateway) ● RAID Controllern ● URL (auch über Proxy mit Authentifizierung) ● LUNs unter Windows (NetApp) ● VMware Überwachung Oce European Data Center - Monitoring ● Rechenzentrum von Océ Printing Systems, Poing ● Lokale IT und Großteil der unternehmensweiten IT ● Netzwerk (Lokal, Corporate und Europa) ● Unser Mengengerüst ● 1800 Hosts ● Davon 50 % mit MS Windows ● Über 160 Netzwerkkomponenten (Switches, Router,Firewalls) ● 16900 Services ● Davon ca 50% auf MS Windows ● Der Rest ist hauptsächlich Unix/Linux, SAN, NetApp Filer und Netzwerk Was treibt einen zum Rewrite? ● check_vmware_api.pl (früher check_esx3.pl) ● ist etabliert ● wird vielfach eingesetzt ● macht doch das was es soll ● 4650 Zeile bestens gepflegter Code ● Offen für Veränderungen Was treibt einen zum Rewrite? ./check_vmware_api.pl -H 192.168.51.3 -u nagios -p MeinPasswort -l runtime -s health CHECK_VMWARE_API.PL OK - 178 health issue(s) found in 178 checks: 1) UNKNOWN[System] Status of System Management Software 0 NMI 0: NMI/Diag Interrupt unknown: Cannot report on the current health state of the element 2) UNKNOWN[System] Status of System Management Software 0 NMI 0: Software NMI unknown: Cannot report on the current health state of the element 3) UNKNOWN[System] Status of System Management Software 0 NMI 0: Fatal NMI - unknown: Cannot report on the current health state of the element 4) UNKNOWN[System] Status of System Board 0 CPU detection 0: Undetermined system hardware failure - unknown: Cannot report on the current health state of the element 5) UNKNOWN[fan] Status of Power Supply 8 FAN PSU2 --- Normal: Cannot report on the current health state of the element 6) UNKNOWN[fan] Status of Power Supply 4 FAN PSU1 --- Normal: Cannot report on the current health state of the element 7) UNKNOWN[fan] Status of Fan Device 4 FAN5 SYS --- Normal: Cannot report on the current health state of the element 8) UNKNOWN[fan] Status of Fan Device 3 FAN4 SYS --- Normal: Cannot report on the current health state of the element 9) UNKNOWN[fan] Status of Fan Device 2 FAN3 SYS --- Normal: Cannot report on the current health state of the element 10) UNKNOWN[fan] Status of Fan Device 1 FAN2 SYS --- Normal: Cannot report on the current health state of the element …..... 176) UNKNOWN[temperature] Status of System Board 0 Systemboard 2 --- Normal: Cannot report on the current health state of the element 177) UNKNOWN[temperature] Status of System Board 0 Systemboard 1 --- Normal: Cannot report on the current health state of the element 178) UNKNOWN[temperature] Status of External Environment 0 Ambient --- Normal: Cannot report on the current health state of the element | Alerts=178;; Was treibt einen zum Rewrite? ./check_vmware_api.pl -H 192.168.51.3 -u nagios -p MeinPasswort -l storage -s lun CHECK_VMWARE_API.PL OK - Local FTS CORP Enclosure Svc Dev (naa.500605b0000272bd) <ok>; Local USB CD-ROM (mpx.vmhba32:C0:T0:L0) <ok>; Local Optiarc CD-ROM (mpx.vmhba3:C0:T0:L0) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a72434a2d3670) <ok>; Local LSI Disk (naa.6003005700ec0d4016 37f5e 50be9336f) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a7333353579 52) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c3465704841364f) <ok>; NET APP Fibre Channel Disk (naa.60a98000646648654c3465704841514e) <ok>; NETAPP Fibre Chan nel Disk (naa.60a98000646648654c34657048395643) <ok>; NETAPP Fibre Channel Disk (naa. 60a9800064666b72684a6570454b6a5a) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648 654c346d3548496370) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a664f6a61 4e77) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346d354744434e) <ok>; NE TAPP Fibre Channel Disk (naa.60a9800064666b72684a733349796749) <ok>; NETAPP Fibre Cha nnel Disk (naa.60a9800064666b72684a676767795530) <ok>; NETAPP Fibre Channel Disk (naa .60a98000646648654c34714e4a703049) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666 b72684a736146337367) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a7249677 65442) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a724961426b37) <ok>; N ETAPP Fibre Channel Disk (naa.60a9800064666b72684a7252357a4e6c) <ok>; NETAPP Fibre Ch annel Disk (naa.60a98000646648654c346570482d6b6d) <ok>; NETAPP Fibre Channel Disk (na a.60a9800064666b72684a6570454d6663) <ok>; NETAPP Fibre Channel Disk (naa.60a980006466 48654c34704439683335) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34715568 385779) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a6570454c5661) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34657048416961) <ok>; NETAPP Fibre C hannel Disk (naa.60a9800064666b72684a715568707049) <ok>; NETAPP Fibre Channel Disk (n aa.60a9800064666b72684a657045486947) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646 66b72684a6570454a4350) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a72496 1326143) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346c2f61565253) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a657045497277) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a7475532f5546) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34676767773437) <ok>; NETAPP Fibre Channel Disk (naa.60a980006 Was treibt einen zum Rewrite? 4666b72684a717659744676) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a736146 705a53) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a703952666377) <ok>; NET APP Fibre Channel Disk (naa.60a9800064666b72684a676767796c6f) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346c2f61535956) <ok>; NETAPP Fibre Channel Disk (naa.60a9800 064666b72684a6f466762434d) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346c2f 6159664c) <ok>; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a657045505178) <ok>; N ETAPP Fibre Channel Disk (naa.60a98000646648654c346570482d3177) <ok>; NETAPP Fibre Chann el Disk (naa.60a9800064666b72684a726430493364) <ok>; NETAPP Fibre Channel Disk (naa.60a9 8000646648654c34657048423277) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c346 570482d5151) <ok>; NETAPP Fibre Channel Disk (naa.60a98000646648654c34664f6a4f4c72) <ok> ; NETAPP Fibre Channel Disk (naa.60a9800064666b72684a723230647362) <ok>; NETAPP Fibre Ch annel Disk (naa.60a9800064666b72684a6570454d3469) <ok>; NETAPP Fibre Channel Disk (naa.6 0a9800064666b72684a724968336855) <ok>; | LUNs=51units;; Was treibt einen zum Rewrite? ● Die Hilfe und Optionen: ● Undurchsichtig, fehlerhaft und unsinnig (z.B.) -l, --command=COMMAND + usagemhz - CPU usage in MHz o breif - list only alerting volumes o quickstats - switch for query either PerfCounter values or Runtime info T (value) - timeshift to detemine if we need to refresh -i, --interval=<sampling period> und -M, --maxsamples=<max sample count> ● Zum Teil irreführend/falsch ● Vorgeschlagene Patches/Änderungen fliessen nicht mit ein bzw. brauchen ewig. Was treibt einen zum Rewrite? ● Fehler im Code ● Fehlinterpretation von Performancecountern ● Die (Code)qualität. ● Verschachtelt (elsif). ● Unterschiedliche Lösungen für dieselbe Aufgabe (z.B. while/until) ● Viele Redundanzen (z.B. in host_runtime_info() 4 x Sensorik) ● Nicht modular (schwer wartbar). ● Nagios::Plugins verwendet, aber nicht konsequent). ● Unsinnige Performancedaten. ● Mangelnde Doku (z.B. Kommentare)t). Verarbeitung von historischen Daten ● Zum Thema timeshift, sample interval etc.: ● The PerformanceManager object manages performance statistics collected from various components, such as a host, virtual machine, clusters and resource pools. The collection of performance statistics is associated with but not limited to managed entities defined for the object model. Those managed entities that are capable of returning performance statistics are Performance Providers. The capabilities of performance providers can be retrieved using (QueryPerfProviderSummary). In PerformanceManager, three sets of methods are used to perform the following: ● Create, remove, or update intervals for historical statistics. ● Query performance statistics. ● Query metadata information about performance statistics counters. ….und die quickstats ● Data Object – VirtualMachineQuickStats A set of statistics that are typically updated with near realtime regularity. This data object type does not support notification, for scalability reasons. Therefore, changes in QuickStats do not generate property collector updates. To monitor statistics values, use the statistics and alarms modules instead. https://www.vmware.com/support/developer/vcsdk/visdk2xpubs/ReferenceGuide/vim.vm.Summary.QuickSt ats.html ….und noch mehr quickstats ● Data Object – HostListSummaryQuickStats Included in the host statistics are fairness scores. Fairness scores are represented in units with relative values, meaning they are evaluated relative to the scores of other hosts. They should not be thought of as having any particular absolute value. Each fairness unit represents an increment of 0.001 in a fairness score. The further the fairness score diverges from 1, the less fair the allocation. Therefore, a fairness score of 990, representing 0.990, is more fair than a fairness score of 1015, which represents 1.015. This is because 1.015 is further from 1 than 0.990. https://www.vmware.com/support/developer/vcsdk/visdk2xpubs/ReferenceGuide/vim.host.Summary.QuickS tats.html Was macht das Ding eigentlich? ● Alle Performancedaten in op5 (Sinnvoll bei Nagios?) ● Commandline Ersatz ● kein VMware Linux Client mehr ● CLI/Remote CLI ziemlich Hardcore ● Monitoring. ● Problem: ● Doku und Hilfe nicht parallel zur Entwickung gepflegt ● Etliche Optionen ok f. 1) oder 2) aber nicht verwertbar f. 3). Ziel: transparent - pflegbar - erweiterbar ● Reduktion auf das für das Monitoring Wesentliche ● Entfernen unnötiger Features und Optionen ● Modularisierung des Codes ● Reformatieren des gesamten Codes ● übersichtlichere Codestruktur ● Einheitliches Codebild/Entflechtung des Codes ● Einheitliche Konstrukte ● Einfügen von Kommentaren ● Eine relativ ausführliche Historie What has been done..... ● Unnötige Performancdaten entfernt (z.B. Anzahl v. Controllern, Anzahl laufende VM,Anzahl Netzwerkkarten.....) ● PATH to MPATH and help from "path - list logical unit paths" to "mpath - list logical unit multipath info" because it is NOT an information about a path - it is an information about multipathing. ● Removed installation informations for the perl SDK from Vmware. ● Replaced global variables with my variables. ● Variablendefinitionen grundsätzlich am Anfang von Funktionen ● Replaced all die with a normal if statement and an exit. ● unless -> if und until -> while ...and done..... ● Nagios::plugin entfernt ● Literale wie CRITICAL durch numerische Werte ersetzt ● return_cluster_DRS_recommendations() entfernt. Unsinnig für ● ● ● ● Alarmierung Main selection -> subroutine main_select (Wegfall elsif “Marathon”) Stripped down vm_cpu_info. Monitoring CPU usage in Mhz macht keinen Sinn unter normalen Umständen $value1 - $valuen entfernt swap in vm_mem_info() entfernt. Aus der vmware documentation: "Current amount of guest physical memory swapped out to the virtual machine's swap file by the VMkernel. Swapped memory stays on disk until the virtual machine needs it. This statistic refers to VMkernel swapping and not to guest OS swapping. swapped = swapin + swapout". This is more an issue of performance tuning rather than alerting. It is not swapping inside the virtual machine. ...and done..... ● OVERHEAD in vm_mem_info() entfernt. vmware documentation: "Amount of machine memory used by the VMkernel to run the virtual machine." So using this we have a useless information about a virtual machine because we have no valid context and we have no valid thresholds. More important is overhead for the host system. And if we are running in problems here we have to look which machine must be moved to another host. As a result of this overall in vm_mem_info() makes no sense. ● swap in vm_mem_info() entfernt. vmware documentation: "Amount of guest physical memory that is currently reclaimed from the virtual machine through ballooning.This is the amount of guest physical memory that has been allocated and pinned by the balloon driver." So here we have again data which makes no sense used alone. You need the context for interpreting them and there are no thresholds for alerting. ...and done..... ● Reimplemented subselect ready in vm_cpu_info and implemented it new in host_cpu_info. From the vmware documentation: "Percentage of time that the virtual machine was ready, but could not get scheduled to run on the physical CPU. CPU ready time is dependent on the number of virtual machines on the host and their CPU loads." High or growing ready time can be a hint CPU bottlenecks (host and guest system) ● Reimplmented subselect wait in vm_cpu_info and implemented it new in host_cpu_info. From the vmware documentation: "CPU time spent in wait state. The wait total includes time spent the CPU Idle, CPU Swap Wait, and CPU I/O Wait states. " High or growing wait time can be a hint I/O bottlenecks (host and guest system) ...and done..... ● Removed subroutines return_dc_performance_values, dc_cpu_info, dc_mem_info, dc_net_info and dc_disk_io_info. Monitored entity was view type HostSystem. This means, that the CPU of the data center server is monitored. ● Replaced $command and $subcommand with $select and $subselect. Therfore also the options -l command and -s subcommand changed ● Kicked out all (I hope so) code for processing historic data from generic_performance_values(). generic_performance_values() is called by return_host_performance_values(), return_host_vmware_performance_values() and return_cluster_performance_values() (return_cluster_performance_values() must be rewritten now). ● The code length of generic_performance_values() was reduced to one third by doing this. ...and done..... ● Changed select option for datastore from vmfs to volumes because we will have volumes on nfs AND vmfs. ● Added volume type to datastore_volumes_info(). So you can see whether the volume is vmfs (local or SAN) or NFS. ● Rewritten and cleaned subroutine host_disk_io_info(). ● Changed the output. Opposite to vm_disk_io_info() most values in host_disk_io_info() are not transfer rates but latency in milliseconds. The output is now clearly understandable. ● Added subselect read. Average number of kilobytes read from the disk each second. Rate at which data is read from each LUN on the host.read rate = # blocksRead per second x blockSize. ● Added subselect write. Average number of kilobytes written to disk each second. Rate at which data is written to each LUN on the host.write rate = # blocksRead per second x blockSize ● Added subselect usage. Aggregated disk I/O rate. For hosts, this metric includes the rates for all virtual machines running on the host. ...and done..... ● Changed "eval { require VMware::VIRuntime };" to "use VMware::VIRuntime;". The eval construct made no sense. If the module isn't available the program will crash with a compile error. ● Moved host_device_info to host_mounted_media_info. Opposite to it's name and the description this function wasn't designed to list all devices on a host. It was designed to show host cds/dvds mounted to one or more virtual machines. This is important for monitoring because a virtual machine with a mount cd or dvd drive can not be moved to another host. Added check for host floppy ● Added SOAP check from Simon Meggle, Consol. Slightly modified to fit. ● Added isblacklisted and isnotwhitelisted from Simon Meggle, Consol. . Enhanced host_mounted_media_info.pm ...and done..... ● host_runtime_info(). ● Filtered out the sensor type "software components". ● Kicked out maintenance info in runtime summary and as subselect. In the beginning of the function is a check for maintenance. In the original program in this case the program will be left with a die which caused a red alert in Nagios. Now an info is displayed and a return code of 1 (warning) is deliverd because a maintenance is regular work but there should be a notice. ● listvms ● Connection info gemäss Handbuch. ● In case of no VMs the plugin returned a critical. But this is not correct. No VMs on a host is not an error. It is simply what it says: No VMs. Usability ● überarbeitete Hilfe ● neue Optionen (--multiline oder –alertonly) ● überarbeitete Optionen ● --option statt -o option Wie geht es weiter? ● Überarbeitung abschließen ● später neue Funktionen realisieren. ● Und zum mithelfen: https://github.com/BaldMansMojo/check_vmware_esx