I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.
NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)
So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.
I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing.If you haven’t got a development environment, build it:
From then on it’s as simple as copy-pasting from the quickstart guide. The only problem I had with the check_http plugin was the lack of openssl-devel. I initially built the plugins without “–with-openssl=/usr/include/openssl” flag. After executing the configure command again the build didn’t work for check_http (undefined symbol foo), but that could be fixed with a “make clean; make”. I just realised that my wordpress theme seems to combine two dashes into 1 – there is nothing I can do about that, sorry (frustrating in the case of the configure command etc)
For the remainder of this article I assume you built nagios with these arguments to configure:
./configure –with-command-group=nagcmd
The plugins have been built with these options:
./configure –with-openssl=/usr/include/openssl –with-nagios-user=nagios –with-nagios-group=nagios
This will install nagios to /usr/local/nagios which is fine by me-you’d obviously choose a different prefix when configuring for a production nagios server . Start the nagios server as per the quickstart guide using “service nagios start”.
With nagios up and running you can connect to the dashboard: http://nagios.example.com/nagios
You authenticate yourself using the nagiosadmin account and the password you supplied earlier to the htpasswd command.
Great! Your nagios environment is up and running. Next you need to add the OMS to the configuration. First of all I opted to change the example configuration-three steps are to be performed:
Again, I should note that this setup is for monitoring 2 OMS hosts only, nothing else. I’m saying this because the way I add the targets is not the most elegant one. If you intend to add more targets to the nagios setup you should opt for a better approach which is commented out in the nagios.cfg file.
Modifying contact information
I would like to be informed in case something goes wrong. Nagios offers a wealth of notification methods, I’m limiting myself to email.
The file you’d like to modify with your favourite text editor is /usr/local/nagios/etc/objects/contacts.cfg
The most basic (but sufficient) way is to edite the nagiosadmin contact. Simply change the email address to your email address and save the file. NB: you may have to configure your local MTA and add a mail relay-ask your friendly sys admin how to do so.
Create the check_oms command
Before we can define it as a target in nagios, we need to tell nagios how to monitor the OMS. Nagios comes with a basic set of plugins, amongst which the check_http seems the most suitable. It needs to be compiled with the openssl-devel package (see above) since the OMS logon requires the https protocol.
Open /usr/local/nagios/etc/objects/commands.cfg with your favourite text editor and add a command such as this one:
define command{
command_name check_oms
command_line $USER1$/check_http -H $HOSTALIAS$ -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon
}
Translated back to English this means that if the check_oms command is defined as a so called service check in nagios then the check_http script is called against the host defined by the host alias (we’ll define that in a minute) variable. Furthermore, if we receive http 302 codes (moved temporarily) I want the check to return a critical error instead of an OK. If my response time is > 5 seconds I want the service to emit a “warning” reply, and if it takes longer than 10 seconds than that’s critical. The remaining variables specify that I need to use SSL against port 7799 (default Grid Control port-change if yours is different) and the URL is /em/console/logon/logon. Don’t simply specify /em as the URL as that will silently redirect you to /em/console/logon/logon after a HTTP 302 message which doesn’t help in this case. You can run the command interactively on the nagios host. The check is in /usr/local/nagios/libexec; the “-v” option displays the HTTP traffic:
./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v
[root@nagios libexec]# ./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v
GET /em/console/logon/logon HTTP/1.1
User-Agent: check_http/v1.4.15 (nagios-plugins 1.4.15)
Connection: close
Host: oms.example.com:7799
https://oms.example.com:7799/em/console/logon/logon is 8671 characters
STATUS: HTTP/1.1 200 OK
**** HEADER ****
Date: Mon, 28 Feb 2011 10:27:14 GMT
Server: Oracle-Application-Server-11g
Set-Cookie: JSESSIONID=tJ0yNr4CgGf4gyTPJR4kKTzL2WBg1SFLQvh0ytrpC3Kgv9xqkDsF!-2069537441; path=/em; HttpOnly
X-ORACLE-DMS-ECID: 00074S^kp_dF8Dd_Tdd9ic0000B4000DkI
X-Powered-By: Servlet/2.5 JSP/2.1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Content-Language: en
**** CONTENT ****
fe8
[content skipped...]
HTTP OK: HTTP/1.1 200 OK – 8671 bytes in 0.067 second response time |time=0.067383s;5.000000;10.000000;0.000000 size=8671B;;;0
[root@nagios libexec]#
Right- HTTP 200 and sub-second response time: I’m happy.
Create a new monitored target
This is the final step to be completed. I started off by copying the localhost.cfg file to oms.cfg and edited it. Below is a sample file with all comments and other verbose information removed:
define host{ use linux-server host_name OMS_pilot alias oms.example.com address 192.168.99.13 } define service{ use generic-service host_name OMS_pilot service_description HTTP is_volatile 0 check_period 24x7 max_check_attempts 10 normal_check_interval 1 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period workhours check_command check_oms }
I’m also using the check_ping command but that’s the default and not shown here. How does nagios know what server to execute the check against? That’s back in the command definition. Remember the -H $HOSTALIAS$ directive? Upon execution of the check, the value of the host’s alias configuration variable will be passed to the check_oms command. You should therefore ensure that the nagios host can resolve that host name, and I’d recommend using the FQDN as well.
The service check will execute the check_oms command against the host every minute 24×7. In case the service is critical, it will notify the contact group admins (which you edited in step 1) and send email during work hours (09:00 – 17:00 by default, defined in timeperiods.cfg.
The final bit where everything is tied together is the nagios.cfg file: add the definition for your host as in this example:
cfg_file=/usr/local/nagios/etc/objects/oms.cfg
Alternatively, if you would like to logically group your objects, you could create /usr/local/nagios/etc/servers and put all your server configuration files in there. Regardless what option you choose, the next step is to reload the nagios service to reflect the current configuration.
(Ignore the warning-that’s a http 403 issue on another host …)
Happy monitoring!
When I start a new project, I like to check performance from as many layers as possible. This helps to verify things are working as expected and helps me to understand how the pieces fit together. My recent work with dNFS and Oracle 11gR2, I started down the path to monitor performance and was surprised to see that things are not always as they seem. This post will explore the various ways to monitor and verify performance when using dNFS with Oracle 11gR2 and Sun Open Storage “Fishworks“.
“iostat(1M)” is one of the most common tools to monitor IO. Normally, I can see activity on local devices as well as NFS mounts via iostat. But, with dNFS, my device seems idle during the middle of a performance run.
bash-3.0$ iostat -xcn 5
cpu
us sy wt id
8 5 0 87
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 6.2 0.0 45.2 0.0 0.0 0.0 0.4 0 0 c1t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 toromondo.west:/export/glennf
cpu
us sy wt id
7 5 0 89
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 57.9 0.0 435.8 0.0 0.0 0.0 0.5 0 3 c1t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 toromondo.west:/export/glennf
From the DB server perspective, I can’t see the IO. I wonder what the array looks like.
The analytics package available with fishworks is the best way to verify performance with Sun Open Storage. This package is easy to use and indeed I was quickly able to verify activity on the array.
There are 48,987 NFSv3 operations/sec and ~403MB/sec going through the nge13 interface. So, this array is cooking pretty good. So, let’s take a peek at the network on the DB host.
nicstat is wonderful tool developed by Brendan Greg at Sun to show network performance. Nicstat really shows you the critical data for monitoring network speeds and feeds by displaying packet size, utilization, and rates of the various interfaces.
root@saemrmb9> nicstat 5
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat
15:32:11 nxge0 0.11 1.51 1.60 9.00 68.25 171.7 0.00 0.00
15:32:11 nxge1 392926 13382.1 95214.4 95161.8 4225.8 144.0 33.3 0.00
So, from the DB server point of view, we are transferring about 390MB/sec… which correlates to what we saw with the analytics from Fishworks. Cool!
Ok, I wouldn’t be a good Sun employee if I didn’t use DTrace once in a while. I was curious to see the Oracle calls for dNFS so I broke out my favorite tool from the DTrace Toolkit. The “hotuser” tool shows which functions are being called the most. For my purposes, I found an active Oracle shadow process and searched for NFS related functions.
root@saemrmb9> hotuser -p 681 |grep nfs
^C
oracle`kgnfs_getmsg 1 0.2%
oracle`kgnfs_complete_read 1 0.2%
oracle`kgnfswat 1 0.2%
oracle`kgnfs_getpmsg 1 0.2%
oracle`kgnfs_getaprocdata 1 0.2%
oracle`kgnfs_processmsg 1 0.2%
oracle`kgnfs_find_channel 1 0.2%
libnfsodm11.so`odm_io 1 0.2%
oracle`kgnfsfreemem 2 0.4%
oracle`kgnfs_flushmsg 2 0.4%
oracle`kgnfsallocmem 2 0.4%
oracle`skgnfs_recvmsg 3 0.5%
oracle`kgnfs_serializesendmsg 3 0.5%
So, yes it seems Direct NFS is really being used by Oracle 11g.
There are a set of V$ tables that allow you to sample the performance of the performance of dNFS as seen by Oracle. I like V$ tables because I can write SQL scripts until I run out of Mt. Dew. The following views are available to monitor activity with dNFS.
With some simple scripting, I was able to create a simple script to monitor the NFS IOPS by sampling the v$dnfs_stats
view. This script simply samples the nfs_read and nfs_write operations, pauses for 5 seconds, then samples again to determine the rate.
timestmp|nfsiops
15:30:31|48162
15:30:36|48752
15:30:41|48313
15:30:46|48517.4
15:30:51|48478
15:30:56|48509
15:31:01|48123
15:31:06|48118.8
Excellent! Oracle shows 48,000 NFS IOPS which agrees with the analytics from Fishworks.
Consulting the AWR, shows “Physical reads” in agreement as well.
Load Profile Per Second Per Transaction Per Exec Per Call
~~~~~~~~~~~~ --------------- --------------- ---------- ----------
DB Time(s): 93.1 1,009.2 0.00 0.00
DB CPU(s): 54.2 587.8 0.00 0.00
Redo size: 4,340.3 47,036.8
Logical reads: 385,809.7 4,181,152.4
Block changes: 9.1 99.0
Physical reads: 47,391.1 513,594.2
Physical writes: 5.7 61.7
User calls: 63,251.0 685,472.3
Parses: 5.3 57.4
Hard parses: 0.0 0.1
W/A MB processed: 0.1 1.1
Logons: 0.1 0.7
Executes: 45,637.8 494,593.0
Rollbacks: 0.0 0.0
Transactions: 0.1
iostat(1M) monitors IO to devices and nfs mount points. But with Oracle Direct NFS, the mount point is bypassed and each shadow process simply mounts files directly. To monitor dNFS traffic you have to use other methods as described here. Hopefully, this post was instructive on how to peel back the layers in-order to gain visibility into dNFS performance with Oracle and Sun Open Storage.
Posted in Oracle, Storage Tagged: 7410, analytics, dNFS, monitoring, network, NFS, Oracle, performance, Solaris
Recent comments
3 years 5 weeks ago
3 years 17 weeks ago
3 years 21 weeks ago
3 years 22 weeks ago
3 years 27 weeks ago
3 years 48 weeks ago
4 years 16 weeks ago
4 years 46 weeks ago
5 years 30 weeks ago
5 years 31 weeks ago