Search

Top 60 Oracle Blogs

Recent comments

monitoring

Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing.If you haven’t got a development environment, build it:

  • # yum install httpd
  • # yum install php
  • # yum install gcc glibc glibc-common
  • # yum install gd gd-devel
  • # yum install openssl-devel

From then on it’s as simple as copy-pasting from the quickstart guide. The only problem I had with the check_http plugin was the lack of openssl-devel. I initially built the plugins without “–with-openssl=/usr/include/openssl” flag. After executing the configure command again the build didn’t work for check_http (undefined symbol foo), but that could be fixed with a “make clean; make”. I just realised that my wordpress theme seems to combine two dashes into 1 – there is nothing I can do about that, sorry (frustrating in the case of the configure command etc)

For the remainder of this article I assume you built nagios with these arguments to configure:

./configure –with-command-group=nagcmd

The plugins have been built with these options:

./configure –with-openssl=/usr/include/openssl –with-nagios-user=nagios –with-nagios-group=nagios

This will install nagios to /usr/local/nagios which is fine by me-you’d obviously choose a different prefix when configuring for a production nagios server . Start the nagios server as per the quickstart guide using “service nagios start”.

With nagios up and running you can connect to the dashboard: http://nagios.example.com/nagios

You authenticate yourself using the nagiosadmin account and the password you supplied earlier to the htpasswd command.

Great! Your nagios environment is up and running. Next you need to add the OMS to the configuration. First of all I opted to change the example configuration-three steps are to be performed:

  • Modify the contacts
  • Create a check_oms command
  • Add the OMS to the list of monitored targets

Again, I should note that this setup is for monitoring 2 OMS hosts only, nothing else. I’m saying this because the way I add the targets is not the most elegant one. If you intend to add more targets to the nagios setup you should opt for a better approach which is commented out in the nagios.cfg file.

Modifying contact information

I would like to be informed in case something goes wrong. Nagios offers a wealth of notification methods, I’m limiting myself to email.

The file you’d like to modify with your favourite text editor is /usr/local/nagios/etc/objects/contacts.cfg

The most basic (but sufficient) way is to edite the nagiosadmin contact. Simply change the email address to your email address and save the file. NB: you may have to configure your local MTA and add a mail relay-ask your friendly sys admin how to do so.

Create the check_oms command

Before we can define it as a target in nagios, we need to tell nagios how to monitor the OMS. Nagios comes with a basic set of plugins, amongst which the check_http seems the most suitable. It needs to be compiled with the openssl-devel package (see above) since the OMS logon requires the https protocol.

Open /usr/local/nagios/etc/objects/commands.cfg with your favourite text editor and add a command such as this one:

define command{
command_name    check_oms
command_line    $USER1$/check_http -H $HOSTALIAS$ -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon
}

Translated back to English this means that if the check_oms command is defined as a so called service check in nagios then the check_http script is called against the host defined by the host alias (we’ll define that in a minute) variable. Furthermore, if we receive http 302 codes (moved temporarily) I want the check to return a critical error instead of an OK. If my response time is > 5 seconds I want the service to emit a “warning” reply, and if it takes longer than 10 seconds than that’s critical. The remaining variables specify that I need to use SSL against port 7799 (default Grid Control port-change if yours is different) and the URL is /em/console/logon/logon. Don’t simply specify /em as the URL as that will silently redirect you to /em/console/logon/logon after a HTTP 302 message which doesn’t help in this case. You can run the command interactively on the nagios host. The check is in /usr/local/nagios/libexec; the “-v” option displays the HTTP traffic:

./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v

[root@nagios libexec]# ./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v
GET /em/console/logon/logon HTTP/1.1
User-Agent: check_http/v1.4.15 (nagios-plugins 1.4.15)
Connection: close
Host: oms.example.com:7799
https://oms.example.com:7799/em/console/logon/logon is 8671 characters
STATUS: HTTP/1.1 200 OK
**** HEADER ****
Date: Mon, 28 Feb 2011 10:27:14 GMT
Server: Oracle-Application-Server-11g
Set-Cookie: JSESSIONID=tJ0yNr4CgGf4gyTPJR4kKTzL2WBg1SFLQvh0ytrpC3Kgv9xqkDsF!-2069537441; path=/em; HttpOnly
X-ORACLE-DMS-ECID: 00074S^kp_dF8Dd_Tdd9ic0000B4000DkI
X-Powered-By: Servlet/2.5 JSP/2.1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Content-Language: en
**** CONTENT ****
fe8
[content skipped...]
HTTP OK: HTTP/1.1 200 OK – 8671 bytes in 0.067 second response time |time=0.067383s;5.000000;10.000000;0.000000 size=8671B;;;0
[root@nagios libexec]#

Right- HTTP 200 and sub-second response time: I’m happy.

Create a new monitored target

This is the final step to be completed. I started off by copying the localhost.cfg file to oms.cfg and edited it. Below is a sample file with all comments and other verbose information removed:

define host{
use               linux-server
host_name         OMS_pilot
alias             oms.example.com
address           192.168.99.13
}

define service{
use                     generic-service
host_name               OMS_pilot
service_description     HTTP
is_volatile             0
check_period            24x7
max_check_attempts      10
normal_check_interval   1
retry_check_interval    1
contact_groups          admins
notification_options    w,u,c,r
notification_interval   960
notification_period     workhours
check_command           check_oms
}

I’m also using the check_ping command but that’s the default and not shown here. How does nagios know what server to execute the check against? That’s back in the command definition. Remember the -H $HOSTALIAS$ directive? Upon execution of the check, the value of the host’s alias configuration variable will be passed to the check_oms command. You should therefore ensure that the nagios host can resolve that host name, and I’d recommend using the FQDN as well.

The service check will execute the check_oms command against the host every minute 24×7. In case the service is critical, it will notify the contact group admins (which you edited in step 1) and send email during work hours (09:00 – 17:00 by default, defined in timeperiods.cfg.

The final bit where everything is tied together is the nagios.cfg file: add the definition for your host as in this example:

cfg_file=/usr/local/nagios/etc/objects/oms.cfg

Alternatively, if you would like to logically group your objects, you could create /usr/local/nagios/etc/servers and put all your server configuration files in there. Regardless what option you choose, the next step is to  reload the nagios service to reflect the current configuration.

(Ignore the warning-that’s a http 403 issue on another host …)

Happy monitoring!

Monitoring Direct NFS with Oracle 11g and Solaris… pealing back the layers of the onion.

When I start a new project, I like to check performance from as many layers as possible.  This helps to verify things are working as expected and helps me to understand how the pieces fit together.  My recent work with dNFS and Oracle 11gR2, I started down the path to monitor performance and was surprised to see that things are not always as they seem.  This post will explore the various ways to monitor and verify performance when using dNFS with Oracle 11gR2 and Sun Open StorageFishworks“.

why is iostat lying to me?

iostat(1M)” is one of the most common tools to monitor IO.  Normally, I can see activity on local devices as well as NFS mounts via iostat.  But, with dNFS, my device seems idle during the middle of a performance run.

bash-3.0$ iostat -xcn 5
cpu
us sy wt id
8  5  0 87
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0    6.2    0.0   45.2  0.0  0.0    0.0    0.4   0   0 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf
cpu
us sy wt id
7  5  0 89
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   57.9    0.0  435.8  0.0  0.0    0.0    0.5   0   3 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf

From the DB server perspective, I can’t see the IO.  I wonder what the array looks like.

what does fishworks analytics have to say about IO?

The analytics package available with fishworks is the best way to verify performance with Sun Open Storage.  This package is easy to use and indeed I was quickly able to verify activity on the array.

There are 48,987 NFSv3 operations/sec and ~403MB/sec going through the nge13 interface.  So, this array is cooking pretty good.  So, let’s take a peek at the network on the DB host.

nicstat to the rescue

nicstat is wonderful tool developed by Brendan Greg at Sun to show network performance.  Nicstat really shows you the critical data for monitoring network speeds and feeds by displaying packet size, utilization, and rates of the various interfaces.

root@saemrmb9> nicstat 5
Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
15:32:11    nxge0    0.11    1.51    1.60    9.00   68.25   171.7  0.00   0.00
15:32:11    nxge1  392926 13382.1 95214.4 95161.8  4225.8   144.0  33.3   0.00

So, from the DB server point of view, we are transferring about 390MB/sec… which correlates to what we saw with the analytics from Fishworks.  Cool!

why not use DTrace?

Ok, I wouldn’t be a good Sun employee if I didn’t use DTrace once in a while.  I was curious to see the Oracle calls for dNFS so I broke out my favorite tool from the DTrace Toolkit. The “hotuser” tool shows which functions are being called the most.  For my purposes, I found an active Oracle shadow process and searched for NFS related functions.

root@saemrmb9> hotuser -p 681 |grep nfs
^C
oracle`kgnfs_getmsg                                         1   0.2%
oracle`kgnfs_complete_read                                  1   0.2%
oracle`kgnfswat                                             1   0.2%
oracle`kgnfs_getpmsg                                        1   0.2%
oracle`kgnfs_getaprocdata                                   1   0.2%
oracle`kgnfs_processmsg                                     1   0.2%
oracle`kgnfs_find_channel                                   1   0.2%
libnfsodm11.so`odm_io                                       1   0.2%
oracle`kgnfsfreemem                                         2   0.4%
oracle`kgnfs_flushmsg                                       2   0.4%
oracle`kgnfsallocmem                                        2   0.4%
oracle`skgnfs_recvmsg                                       3   0.5%
oracle`kgnfs_serializesendmsg                               3   0.5%

So, yes it seems Direct NFS is really being used by Oracle 11g.

performance geeks love V$ tables

There are a set of V$ tables that allow you to sample the performance of the performance of dNFS as seen by Oracle.  I like V$ tables because I can write SQL scripts until I run out of Mt. Dew.  The following views are available to monitor activity with dNFS.

  • v$dnfs_servers: Shows a table of servers accessed using Direct NFS.
  • v$dnfs_files: Shows a table of files now open with Direct NFS.
  • v$dnfs_channels: Shows a table of open network paths (or channels) to servers for which Direct NFS is providing files.
  • v$dnfs_stats: Shows a table of performance statistics for Direct NFS.

With some simple scripting, I was able to create a simple script to monitor the NFS IOPS by sampling the v$dnfs_stats view.  This script simply samples the nfs_read and nfs_write operations, pauses for 5 seconds, then samples again to determine the rate.

timestmp|nfsiops
15:30:31|48162
15:30:36|48752
15:30:41|48313
15:30:46|48517.4
15:30:51|48478
15:30:56|48509
15:31:01|48123
15:31:06|48118.8

Excellent!  Oracle shows 48,000 NFS IOPS which agrees with the analytics from Fishworks.

what about the AWR?

Consulting the AWR, shows “Physical reads” in agreement as well.

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               93.1            1,009.2       0.00       0.00
       DB CPU(s):               54.2              587.8       0.00       0.00
       Redo size:            4,340.3           47,036.8
   Logical reads:          385,809.7        4,181,152.4
   Block changes:                9.1               99.0
  Physical reads:           47,391.1          513,594.2
 Physical writes:                5.7               61.7
      User calls:           63,251.0          685,472.3
          Parses:                5.3               57.4
     Hard parses:                0.0                0.1
W/A MB processed:                0.1                1.1
          Logons:                0.1                0.7
        Executes:           45,637.8          494,593.0
       Rollbacks:                0.0                0.0
    Transactions:                0.1

so, why is iostat lying to me?

iostat(1M) monitors IO to devices and nfs mount points.  But with Oracle Direct NFS, the mount point is bypassed and each shadow process simply mounts files directly.  To monitor dNFS traffic you have to use other methods as described here.  Hopefully, this post was instructive on how to peel back the layers in-order to gain visibility into dNFS performance with Oracle and Sun Open Storage.

Posted in Oracle, Storage Tagged: 7410, analytics, dNFS, monitoring, network, NFS, Oracle, performance, Solaris