Search

Top 60 Oracle Blogs

Recent comments

Grid Control

Not So Smart

It’s been a few months since I did an install of Oracle Enterprise Manager 11g. I am however talking about some experiences from a real world implementation that I performed a while back at the UKOUG’s Management and Infrastructure SIG on the 27th. (you can book for this event here ) I thought therefore that [...]

Cluster callouts to create blackouts in EM

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation.Now with the background explained, management had to weigh the possibilities-either not register the RAC One database in Grid Control and have no monitoring at all or to bite the bullet and have monitoring only when the initial instance is started on the primary node. The decision was made to have (limited) monitoring. To prevent the DBA from being woken up I developed the simple script below to automatically create a blackout in GC if the “_2″ instance starts. Subsequently, the blackout is taken off when the “_1″ instance starts.

Room for improvement: if the script assumes that a RON database can only have a maximum of 2 member servers-if your database can run on more than 2 nodes then you should use the relocate_target if the _1 instance comes up on a different node from what GC expects.

The Script

My algorithm checks for cluster events, and if an instance dbName_2 starts, I create a blackout on the initial instance to prevent being paged until Oracle have come up with a better solution (we are flying blind once the 2nd database instance has started).

The script assumes that you have deployed emcli on each cluster node (or ACFS). EMCLI is the Enterprise Manager Command Line Lnterface, it’s located on your OMS together with the installation instructions. This is the default location:: https://oms.example.com:7799/em/console/emcli/download – 7799 is the default port for Grid Control.

Let’s have a look at the script:

#!/bin/bash

# enable debugging if needed
set -x
exec >> /tmp/autoBlackout.log 2>&1

EVENTTYPE=$1

# only SERVICEMEMBER populates instance, database and service as needed
# for the blackout section below.
if [ "$EVENTTYPE" != "SERVICEMEMBER" ]; then
 exit 0
fi

# adjust to your needs or set to the empty string if you are not using db_domain
# assumes that both database and service have the same domain
DOMAIN=example.com

# bail out if there are too many instances of this script running. Inform the
# admin via email
ME=`basename $0`
RUNNING=`ps -ef | grep -v grep | grep $ME | wc -l`
if [ $RUNNING -ge 6 ]; then
 echo Too many instances of this script running, aborting
 echo Too many instances of $ME running, aborting |
 mail -s "$RUNNING instances of $ME detected on `hostname`" admin@example.com
fi

# set up for emcli (emcli requires jdk 1.6)
JAVA_HOME=/shared/acfs/emcli/jdk1.6.0_24
PATH=$JAVA_HOME/bin:$PATH
EMCLI=/shared/acfs/emcli/emcli
export JAVA_HOME PATH

# turn off debugging for a moment - the below parsing of the command line
# parameters is very verbose.
set +x

# read the parameters passed to us-modified version of a script
# found at rachelp.nl
for ARGS in $*;
 do
 PROPERTY=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $1}'`
 VALUE=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $2}'`
 case $PROPERTY in
 VERSION|version)    VERSION=$VALUE ;;
 SERVICE|service)    SERVICE=$VALUE ;;
 DATABASE|database)    DATABASE=$VALUE ;;
 INSTANCE|instance)    INSTANCE=$VALUE ;;
 HOST|host)        HOST=$VALUE ;;
 STATUS|status)        STATUS=$VALUE ;;
 REASON|reason)        REASON=$VALUE ;;
 CARD|card)        CARDINALITY=$VALUE ;;
 TIMESTAMP|timestamp)    LOGDATE=$VALUE ;;
 ??:??:??)        LOGTIME=$VALUE ;;
 esac
 done

# and turn debugging on again
set -x

# targets are reported in lower case :( Someone please suggest a better
# way to get a lower case string to upper case
DATABASE=`echo $DATABASE | tr "[a-z]" "[A-Z]"`

# targets affected are rac_database and the oracle_database (instance)
# not using emcli here as it has to be quick. A rac_database target is a
# composite target, consisting of multiple oracle_database targets. In
# RAC One Node there is only one 1 instance - see output from GC below:
# $ emcli get_targets | grep "RAC"
# 0       Down           oracle_database  RAC.example.com_RAC_1
# 0       Down           rac_database     RAC.example.com

# define what we want to black out (only ever the primary instance!)
BLACKOUT_NAME=blackout_${DATABASE}
BLACKOUT_TARGETS="$DATABASE.${DOMAIN}:rac_database;${DATABASE}.${DOMAIN}_${DATABASE}_1:oracle_database"

# create a blackout if the secondary instance is up (we only ever register the _1 instance)
# the blackout duration is indefinite-it will be stopped and lifted automatically. You may
# want to limit this to a few hours to raise visibility.
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_2" ]]; then
 echo create blackout
 $EMCLI login -username=user -password=supersecretpassword
 $EMCLI create_blackout -name=${BLACKOUT_NAME} -add_targets=${BLACKOUT_TARGETS} \
 -reason="auto blackout" -schedule="frequency:once;duration:-1"
fi

# disable the blackout if instance *_1 starts
# this is where the script could be improved if the RON database can run on more
# than 2 nodes. You could use emcli relocate_target to relocate the target to another
# node
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_1" ]]; then
 echo remove blackout
 $EMCLI login -username=user-password=supersecretpassword
 $EMCLI stop_blackout -name=${BLACKOUT_NAME}
 $EMCLI delete_blackout -name=${BLACKOUT_NAME}
fi

I tried to add a lot of comments to the script, which should make it easy for you to adjust it. I recommend you store it in ACFS and mount that directory on all cluster nodes. Create a symbolic link from the ACFS to $GRID_HOME/racg/usrco/ to make maintenance easier. You could enable log rotation for the logfile in /tmp if you liked, otherwise keep an eye on it so it doesn’t grow to gigabytes.

Automatic log gathering for Grid Control 11.1

Still debugging the OMS problem (it occasionally hangs and has to be restarted) I wrote a small shell script to help me gather all required logs for Oracle support. These are the logs I need for the SR, Niall Litchfield has written a recent blog post about other useful log locations.

The script is basic, and can possibly be extended. However it saved me a lot of time getting all the required information to one place from where I could take it and attach it to the service request. Before uploading I usually zip all files into dd-mm-yyyy-logs.nnn.zip to avoid clashing with logs already uploaded. I run the script via cron daily at 09:30.

#!/bin/bash

# A script to gather diagnostic information for OMS GC 11.1 for use by Oracle
# Support
#
# WARNING the previous output of the script's execution is NOT preserved!
#
# (c) Martin Bach Consulting Ltd All rights reserved
# http://martinbach-consulting.com
#
# The script performs the following tasks
# 1) creates a heap dump for the EMGC_OMS1 server
# 2) creates a compressed tar archive of all server logs
#    including the webtier

set -x

DST_DIR=/tmp/sr
GC_INST=/u01/app/oracle/product/gc_inst

# get heap dump for EMGC_OMS1-I upped the JVM args to -Xms of 768M which works
# fine as an identifier in Solaris 10. For Linux you may have to grep for EMGC_OMS1
# in the ps output
# the heap dump goes into EMGC_OMS1.out which is compressed and saved later
DUMP=`ps -ef | grep "Xms768m" | grep -v grep | awk '{print "kill -3 "$2}'`
$DUMP

if [ ! -d $DST_DIR ]; then
echo $DST_DIR does not exist-creating it
mkdir -p $DST_DIR
fi

# get the relevant logs
cd $GC_INST/em/EMGC_OMS1/sysman/log/
tar -cvf - . | gzip > $DST_DIR/gc_inst.em.EMGC_OMS1.sysman.log.tar.gz

cd $GC_INST/user_projects/domains/GCDomain/servers/EMGC_OMS1/logs/
tar -cvf - . | gzip > $DST_DIR/gc_inst.user_projects.domains.gcdomain.servers.emgc_oms1.log.tar.gz

cd $GC_INST/user_projects/domains/GCDomain/servers/EMGC_ADMINSERVER/logs/
tar -cvf - . | gzip > $DST_DIR/gc_inst.user_projects.domains.gcdomain.servers.adminserver.log.tar.gz

cd $GC_INST/WebTierIH1/diagnostics/logs/OHS/ohs1
tar -cvf - . | gzip > $DST_DIR/webtier.ohs1.log.tar.gz

cd $GC_INST/WebTierIH1/diagnostics/logs/OPMN/opmn
tar -cvf - . | gzip > $DST_DIR/webtier.opmn.log.tar.gz

Good luck!

Quis custodiet ipsos custodies-Nagios monitoring for Grid Control

I have a strange problem with my Grid Control 11.10.1.2 Management Server in a Solaris 10 zone. When restarted, the OMS will serve requests fine for about 2 to 4 hours and then “hang”. Checking the Admin Server console I can see that there are stuck threads. The same information is also recorded in the logs.

NB: the really confusing part about Grid Control 11.1 is the use of Weblogic-you thought you knew where the Grid Control logs where? Forget about what you knew about 10.2 and enter a different dimension :)

So to be able to react quicker to a hang of the OMS (or EMGC_OMS1 to be more precise) I set up nagios to periodically poll the login page.

I’m using a VM with OEL 5.5 64bit to deploy nagios to, the requirements are very moderate. The install process is well documented in the quickstart guide-I’m using Fedora as a basis. OEL 5.5 doesn’t have nagios 3 RPMs available, so I decided to use the source downloaded from nagios.org. The tarballs you need are nagios-3.2.3.tar.gz and nagios-plugins-1.4.15.tar.gz at the time of this writing.If you haven’t got a development environment, build it:

  • # yum install httpd
  • # yum install php
  • # yum install gcc glibc glibc-common
  • # yum install gd gd-devel
  • # yum install openssl-devel

From then on it’s as simple as copy-pasting from the quickstart guide. The only problem I had with the check_http plugin was the lack of openssl-devel. I initially built the plugins without “–with-openssl=/usr/include/openssl” flag. After executing the configure command again the build didn’t work for check_http (undefined symbol foo), but that could be fixed with a “make clean; make”. I just realised that my wordpress theme seems to combine two dashes into 1 – there is nothing I can do about that, sorry (frustrating in the case of the configure command etc)

For the remainder of this article I assume you built nagios with these arguments to configure:

./configure –with-command-group=nagcmd

The plugins have been built with these options:

./configure –with-openssl=/usr/include/openssl –with-nagios-user=nagios –with-nagios-group=nagios

This will install nagios to /usr/local/nagios which is fine by me-you’d obviously choose a different prefix when configuring for a production nagios server . Start the nagios server as per the quickstart guide using “service nagios start”.

With nagios up and running you can connect to the dashboard: http://nagios.example.com/nagios

You authenticate yourself using the nagiosadmin account and the password you supplied earlier to the htpasswd command.

Great! Your nagios environment is up and running. Next you need to add the OMS to the configuration. First of all I opted to change the example configuration-three steps are to be performed:

  • Modify the contacts
  • Create a check_oms command
  • Add the OMS to the list of monitored targets

Again, I should note that this setup is for monitoring 2 OMS hosts only, nothing else. I’m saying this because the way I add the targets is not the most elegant one. If you intend to add more targets to the nagios setup you should opt for a better approach which is commented out in the nagios.cfg file.

Modifying contact information

I would like to be informed in case something goes wrong. Nagios offers a wealth of notification methods, I’m limiting myself to email.

The file you’d like to modify with your favourite text editor is /usr/local/nagios/etc/objects/contacts.cfg

The most basic (but sufficient) way is to edite the nagiosadmin contact. Simply change the email address to your email address and save the file. NB: you may have to configure your local MTA and add a mail relay-ask your friendly sys admin how to do so.

Create the check_oms command

Before we can define it as a target in nagios, we need to tell nagios how to monitor the OMS. Nagios comes with a basic set of plugins, amongst which the check_http seems the most suitable. It needs to be compiled with the openssl-devel package (see above) since the OMS logon requires the https protocol.

Open /usr/local/nagios/etc/objects/commands.cfg with your favourite text editor and add a command such as this one:

define command{
command_name    check_oms
command_line    $USER1$/check_http -H $HOSTALIAS$ -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon
}

Translated back to English this means that if the check_oms command is defined as a so called service check in nagios then the check_http script is called against the host defined by the host alias (we’ll define that in a minute) variable. Furthermore, if we receive http 302 codes (moved temporarily) I want the check to return a critical error instead of an OK. If my response time is > 5 seconds I want the service to emit a “warning” reply, and if it takes longer than 10 seconds than that’s critical. The remaining variables specify that I need to use SSL against port 7799 (default Grid Control port-change if yours is different) and the URL is /em/console/logon/logon. Don’t simply specify /em as the URL as that will silently redirect you to /em/console/logon/logon after a HTTP 302 message which doesn’t help in this case. You can run the command interactively on the nagios host. The check is in /usr/local/nagios/libexec; the “-v” option displays the HTTP traffic:

./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v

[root@nagios libexec]# ./check_http -H oms.example.com -f critical -w 5 -c 10 –ssl -p 7799 –url /em/console/logon/logon -v
GET /em/console/logon/logon HTTP/1.1
User-Agent: check_http/v1.4.15 (nagios-plugins 1.4.15)
Connection: close
Host: oms.example.com:7799
https://oms.example.com:7799/em/console/logon/logon is 8671 characters
STATUS: HTTP/1.1 200 OK
**** HEADER ****
Date: Mon, 28 Feb 2011 10:27:14 GMT
Server: Oracle-Application-Server-11g
Set-Cookie: JSESSIONID=tJ0yNr4CgGf4gyTPJR4kKTzL2WBg1SFLQvh0ytrpC3Kgv9xqkDsF!-2069537441; path=/em; HttpOnly
X-ORACLE-DMS-ECID: 00074S^kp_dF8Dd_Tdd9ic0000B4000DkI
X-Powered-By: Servlet/2.5 JSP/2.1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Content-Language: en
**** CONTENT ****
fe8
[content skipped...]
HTTP OK: HTTP/1.1 200 OK – 8671 bytes in 0.067 second response time |time=0.067383s;5.000000;10.000000;0.000000 size=8671B;;;0
[root@nagios libexec]#

Right- HTTP 200 and sub-second response time: I’m happy.

Create a new monitored target

This is the final step to be completed. I started off by copying the localhost.cfg file to oms.cfg and edited it. Below is a sample file with all comments and other verbose information removed:

define host{
use               linux-server
host_name         OMS_pilot
alias             oms.example.com
address           192.168.99.13
}

define service{
use                     generic-service
host_name               OMS_pilot
service_description     HTTP
is_volatile             0
check_period            24x7
max_check_attempts      10
normal_check_interval   1
retry_check_interval    1
contact_groups          admins
notification_options    w,u,c,r
notification_interval   960
notification_period     workhours
check_command           check_oms
}

I’m also using the check_ping command but that’s the default and not shown here. How does nagios know what server to execute the check against? That’s back in the command definition. Remember the -H $HOSTALIAS$ directive? Upon execution of the check, the value of the host’s alias configuration variable will be passed to the check_oms command. You should therefore ensure that the nagios host can resolve that host name, and I’d recommend using the FQDN as well.

The service check will execute the check_oms command against the host every minute 24×7. In case the service is critical, it will notify the contact group admins (which you edited in step 1) and send email during work hours (09:00 – 17:00 by default, defined in timeperiods.cfg.

The final bit where everything is tied together is the nagios.cfg file: add the definition for your host as in this example:

cfg_file=/usr/local/nagios/etc/objects/oms.cfg

Alternatively, if you would like to logically group your objects, you could create /usr/local/nagios/etc/servers and put all your server configuration files in there. Regardless what option you choose, the next step is to  reload the nagios service to reflect the current configuration.

(Ignore the warning-that’s a http 403 issue on another host …)

Happy monitoring!

GC 11.1 and Monitoring Templates

Throughout the last 2 weeks I have been working (or better: tried to work) with Grid Control 11.1 as the central monitoring and deployment solution for my current project.

The plan is to use EMGC 11.1 in conjunction with an 8 node cluster to automatically deploy RAC One Node databases. Please don’t ask about RAC One Node-that wasn’t my decision, and as I understand the previous project members only chose this as a poor compromise to keep the operations team happy(-ish)

Besides the fact that the OMS-which runs in a Solaris Zone repeatedly “hangs” and can’t be contacted by emcli or any browser (Bug 11804553)-RAC One Node is NOT SUPPORTED as a target in Grid Control 11.1. It might be supported in GC 12.1 later in 2011. But I digress

The Requirement

The OPS team maintains their own 10.2.0.5 management servers. To allow us to perform some testing with the automatic database deployment without messing with a life OMS, it has been decided to install OEM GC 11.1 with PSU 2 locally on Solaris with a repository database on Linux. We needed GC11.1 to supoprt our 11.2.0.2 cluster.

After the installation of the OMS I tried to export the required management templates from the life OMS (remember it’s 10.2.0.5) and import them into 11.1 to save myself a lot of work.

Export a management template

The export function seems to have been introduced in 10.2.0.3 and it works great. All you need to do it hop on the OMS, and use “emcli” (Enterprise Manager Command Line Interface) to log on and export the template. A sample session is shown here:

  • emcli login -username=yourUserName -password=yourPassword
  • emcli export_template -name=TemplateName -target_type=TargetType -output_file=/path/to/templateName.xml

If you are unsure about template names and targets, you can connect to the repository as sysman and query mgmt_templates:

SQL> SELECT TEMPLATE_NAME,TARGET_TYPE FROM MGMT_TEMPLATES;

And so I happily exported the management templates from the 10.2.0.5 OMS.

The Bad News

Unfortunately, you can’t import non 11.1 templates into an 11.1 OMS. When I tried it I got the following error:

$ emcli import_template -files=”emd.10205.xml”
Monitoring template file emd.10205.xml exported from 10.2.0.5.0 OMS can not be imported to 11.1.0.1.0 OMS

Bugger. Sure enough, the XML file has a version tag:

<?xml version = '1.0' encoding = 'UTF-8'?>

...

The solution is to revert to the bad old times and manually comparing source and destination. A rather laborious and tiresome way of getting information across. Don’t forget to export the completed template from 11.1 to save yourself from going through that again.

Error message of the day: OUI-25023 and the FQDN

It’s been a long day with many problems around a Grid Control installation, including (but not limited to) corruption of the repository database, bugs in OUI when it comes to deinstalling the Oracle Management Server, lots of files left over by the weblogic “uninstall.sh” script and many more. Some of the error messages were quite misleading, and OUI-25023 just was one too many. What happened?

Earlier today I was trying to install the 64bit 11.1.0.1 agent on an 8 node cluster. After an initial headache (see below) it worked ok. However, I couldn’t resist mentioning OUI-25023. Here’s the complete story.

I downloaded the 11.1 agent for linux x86-64 as per the GC 11.1 documentation and deployed it to my fresh-installed management server. The OMS is on Solaris SPARC, and Grid Control doesn’t supply agents for a different platform. However, the security experts have locked the oracle account down on the cluster which ruled out the “agent push” scenario. I then opted for the installation via a response file, as described in the documentation.

The idea is that you set a number of variables in a response file “additional_agent.rsp” to inform the installer about your desired configuration. If memory serves me right then there was no way to run OUI in GUI mode, it had to be a silent installation.

Amongst the variables you need to specify the cluster nodes. I dutifully filled that information in, using the {“node1.example.com”, “node2.example.com”, …, “node8.example.com”} – i.e. the fully qualified domain name as per the documentation.

Starting the installer with the -silent and -responseFile /path/to/additional_agent.rsp options failed:

$ ./linux_x64/agent/runInstaller -silent -responseFile /u01/app/oracle/stage/agent_11.1.0.1/linux....
Starting Oracle Universal Installer...

Checking Temp space: must be greater than 150 MB.   Actual 1662 MB    Passed
Checking swap space: must be greater than 150 MB.   Actual 16415 MB    Passed
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2011-02-15_03-15-10PM. Please wait ...
*** Check for updates ***
*** Select Installation Type ***
*** Check Prerequisites ***
*** Specify Oracle Management Service Location ***
*** Customize Ports ***
*** Review ***

ERROR: OUI-25023: The local node is not selected for installing this product. Include the local node in the cluster list or perform the installation on the nodes on which the install is to be performed.Also note that the cluster nodes should be specified in non FQDN format.

$

I was pretty sure that my node_list included my first node… Pinging the node and DNS access also worked (I wrote about that in a previous post). So a little research on Metalink revealed this gem:

OUI-25023 When Trying To Install A Patchset On RAC [ID 394868.1]

So contradicting the error message (…FQDN format…) I needed to specify the node_list as in the inventory: without the “example.com” bit. Once that was changed in the response file, the installation completed ok.

Running Grid Control Agent commands standalone

I had an error message today from one of my grid agents which was cut short in the GUI just when it became interesting. So I thought of a way of running the command on the comand line to get the full output.

This has been a little easier than I thought. I based my approach on an earlier blog article on my knowledgebase to get the perl environment variables set. I then needed to figure out where some of the libraries (perl scripts ending in *.pm) the agent script are referring were located.

A simple “locate -i *pm | grep $ORACLE_HOME” did it. This enabled me to write a preliminary script to run an EM agent task, shown below. It expects that you have ran “oraenv” previously to set the environment to the AGENT_HOME. When referring to ORACLE_HOME in the following, the AGENT_HOME is meant. It takes the full parameter to the script to be executed as the parameter and checked for ORACLE_HOME and $1 to exist.

#!/bin/bash

if [ "${ORACLE_HOME}" == "" ]; then
echo "Need to specify the ORACLE_HOME first!"
exit 1
fi

if [ "$1" == "" ]; then
echo "Need to specify the full path to the script to run"
exit 1;
fi

LD_LIBRARY_PATH=$ORACLE_HOME/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

PERLBIN=$ORACLE_HOME/perl/bin/perl
PERLV=`${PERLBIN} -e 'printf "%vd", $^V'`

PERLINC="-I ${ORACLE_HOME}/perl/lib/${PERLV}"
PERLINC="${PERLINC} -I ${ORACLE_HOME}/perl/lib/site_perl/${PERLV}"
PERLINC="${PERLINC} -I ${ORACLE_HOME}/lib"
PERLINC="${PERLINC} -I $ORACLE_HOME/sysman/admin/scripts/"

RUNTHIS="${PERLBIN} ${PERLINC} $1"

exec ${RUNTHIS}

This allowed me to run the script crs_status.pl – $ORACLE_HOME/sysman/admin/scripts/ without problem.

11.1 GC agent refuses to start

This is a follow-up post from my previous tale of how not to move networks for RAC. After having successfully restarted the cluster as described here in a previous post I went on to install a Grid Control 11.1 system. This was to be on Solaris 10 SPARC-why SPARC? Not my platform of choice when it comes to Oracle software, but my customer has a huge SPARC estate and wants to make most of it.

After the OMS has been built (hopefully I’ll find time to document this as it can be quite tricky on SPARC) I wanted to secure the agents on my cluster against it. That worked ok for the first node:

  • emctl clearstate agent
  • emctl secure agent
  • emctl start agent

Five minutes later the agent appeared in my list of agents in the Grid Control console. With this success backing me I went to do the same on the next cluster node.

Here things were different-here’s the sequence of commands I used:

$ emctl stop agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved
$

I didn’t pay too much attention to the fact that there has been no acknowledgement of the completion of the stop command. I noticed something wasn’t quite right when I tried to get the agent’s status:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
emctl stop agent
Error connecting to https://node2.example.com:3872/emd/main

Now that should have reported that the agent was down. Strange. I tried a few more commands,  such as the following one to start the agent.

[agent]oracle@node2.example.com $ emctl start agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already running

Which wasn’t the case-there was no agent process whatsoever in the process table. I also checked the emd.properties file. Note that the emd.properties file is in $AGENT_HOME/hostname/sysman/config/ now instead of $AGENT_HOME/sysman/config as it was in 10g.

Everything looked correct, and even a comparison with the first node didn’t reveal any discrepancy. So I scratched my head a little more until I found a MOS note on the subject stating that the agent cannot listen to multiple addresses. The note is for 10g only and has the rather clumsy title “Grid Control Agent Startup: “emctl start agent” Command Returns “Agent is already running” Although the Agent is Stopped (Doc ID 1079424.1)

Although stating it’s for 10g and multiple NICs it got me thinking. And indeed, the /etc/hosts file has not been updated, leaving the old cluster address in /etc/hosts while the new one was in DNS.

# grep node2 /etc/hosts
10.x.x4.42            node2.example.com node2
172.x.x.1x8          node2-priv.example.com node2-priv
# host node2.example.com
node2.example.com has address 10.x5.x8.3
[root@node2 ~]# grep ^hosts /etc/nsswitch.conf
hosts:      files dns

This also explained why the agent started on the first node-it had an updated /etc/hosts file. Why the other nodes didn’t have their hosts file updated will forever remain a mystery.

Things then changed dramatically after the hosts file has been updated:

$ emctl status agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running

Note how emctl acknowledges that the agent it down now. I successfully secured and started the agent:

$ emctl secure agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already stopped...   Done.
Securing agent...   Started.
Enter Agent Registration Password :
Securing agent...   Successful.

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running
$ emctl start agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Starting agent .............. started.

One smaller problem remained:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:00:13
Last successful upload                       : 2011-02-14 10:00:19
Total Megabytes of XML files uploaded so far :    11.56
Number of XML files pending upload           :      188
Size of XML files pending upload(MB)         :    65.89
Available disk space on upload filesystem    :    60.11%
Collection Status                            : Disabled by Upload Manager
Last successful heartbeat to OMS             : 2011-02-14 10:00:17
---------------------------------------------------------------
Agent is Running and Ready

The message in red highlights the “Disabled by Upload Manager”. That’s because a lot of stuff hasn’t been transferred yet. Let’s force an upload-I know the communication between agent and OMS is working, so that should resolve the issue.

$ emctl upload
$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:02:12
Last successful upload                       : 2011-02-14 10:02:53
Total Megabytes of XML files uploaded so far :    91.12
Number of XML files pending upload           :       22
Size of XML files pending upload(MB)         :     1.50
Available disk space on upload filesystem    :    60.30%
Last successful heartbeat to OMS             : 2011-02-14 10:02:19
---------------------------------------------------------------
Agent is Running and Ready

That’s about it-a few minutes later the agent was visible on the console. Now that only had to be repeated for all remaining 6 nodes…

NB: For the reasons shown in this article I don’t endorse duplicating host information in /etc/hosts and DNS-a resilient DNS infrastructure should always be used to store this kind of information.

Using a Windows Authenticated Proxy Server with EM11g

Over the years I have a number of issues with proxy configuration in Enterprise Manager Grid Control. The product is very definitely improving in this regard, however EM 11g still does not currently correctly integrate enterprise proxy servers that authmatically authenticate users using NTLM – a fairly common configuration given the ubiquity of Active Directory in the marketplace [...]

Patch 10270073 – 11.1.0.1.2 Patch Set Update for Oracle Management Service

Today is patch day – on my current site we are having quite a few problems with our management server (OMS from now on). It occasionally simply “hangs” and doesn’t respond when connecting to the SSH port. A few minutes later it’s back to normal-but this behaviour is not reproducible.

So in an effort to please Oracle support who couldn’t find a reason for this I decided to apply Patch Set Update 2 to the OMS to get it to 11.1.0.1.2. I would also like to filter the corresponding agent patch for our 11.1 agents through as well. The PSU has been released 2 weeks ago so it’s reasonably fresh.The patches for OMS and agent are generic, which is nice as it implies they are available for every platform. Our OMS had had problems before, and one-off patches have been applied. So the first step as always with PSUs is to check if there are conflicts. The readme file has the required instructions. I unzipped p10270073_111010_Generic.zip in /tmp/ and then executed the prerequisite checker in /tmp as shown in this example:

[oms]oracle@oms $ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./10270073
Invoking OPatch 11.1.0.8.0

Oracle Interim Patch Installer version 11.1.0.8.0
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

PREREQ session

Oracle Home       : /u01/app/oracle/product/middleware/oms11g
Central Inventory : /u01/app/oracle/product/oraInventory
 from           : /var/opt/oracle/oraInst.loc
OPatch version    : 11.1.0.8.0
OUI version       : 11.1.0.8.0
OUI location      : /u01/app/oracle/product/middleware/oms11g/oui
Log file location : /u01/app/oracle/product/middleware/oms11g/cfgtoollogs/opatch/opatch2011-02-01_08-48-14AM.log

Patch history file: /u01/app/oracle/product/middleware/oms11g/cfgtoollogs/opatch/opatch_history.txt

OPatch detects the Middleware Home as "/u01/app/oracle/product/middleware"

Invoking prereq "checkconflictagainstohwithdetail"

ZOP-40: The patch(es) has conflicts/supersets with other patches installed in the Oracle Home (or) among themselves.

Prereq "checkConflictAgainstOHWithDetail" failed.

Summary of Conflict Analysis:

Patches that can be applied now without any conflicts are :
10270073

Following patches are not required, as they are subset of the patches in Oracle Home or subset of the patches in the given list :
9563902, 9537948, 9489355

Following patches will be rolled back from Oracle Home on application of the patches in the given list :
9563902, 9537948, 9489355

Conflicts/Supersets for each patch are:

Patch : 10270073

 Bug Superset of 9563902
 Super set bugs are:
 9491872,  9476313,  9544428

 Bug Superset of 9537948
 Super set bugs are:
 9537948

 Bug Superset of 9489355
 Super set bugs are:
 9489355

OPatch succeeded.
[oms]oracle@oms $

OK, so a few one-offs will be rolled back. Let’s get started. First of all we have to stop the OMS as shown here:

[oms]oracle@oms $ emctl stop oms
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Stopping WebTier...
WebTier Successfully Stopped
Stopping Oracle Management Server...
Oracle Management Server Successfully Stopped
Oracle Management Server is Down

The next step is to apply the patch. Here’s the sample session:

[oms]oracle@oms $ cd 10270073

[oms]oracle@oms $ $ORACLE_HOME/OPatch/opatch apply
Invoking OPatch 11.1.0.8.0

Oracle Interim Patch Installer version 11.1.0.8.0
Copyright (c) 2009, Oracle Corporation.  All rights reserved.

Oracle Home       : /u01/app/oracle/product/middleware/oms11g
Central Inventory : /u01/app/oracle/product/oraInventory
 from           : /var/opt/oracle/oraInst.loc
OPatch version    : 11.1.0.8.0
OUI version       : 11.1.0.8.0
OUI location      : /u01/app/oracle/product/middleware/oms11g/oui
Log file location : /u01/app/oracle/product/middleware/oms11g/cfgtoollogs/opatch/opatch2011-02-01_09-00-00AM.log

Patch history file: /u01/app/oracle/product/middleware/oms11g/cfgtoollogs/opatch/opatch_history.txt

OPatch detects the Middleware Home as "/u01/app/oracle/product/middleware"

ApplySession applying interim patch '10270073' to OH '/u01/app/oracle/product/middleware/oms11g'
Interim patch 10270073 is a superset of the patch(es) [  9563902 9537948 9489355 ] in the Oracle Home
OPatch will rollback the subset patches and apply the given patch.
Execution of 'sh /tmp/10270073/custom/scripts/init -apply 10270073 ':

Return Code = 0

Running prerequisite checks...

OPatch detected non-cluster Oracle Home from the inventory and will patch the local system only.

Backing up files and inventory (not for auto-rollback) for the Oracle Home
Backing up files affected by the patch '10270073' for restore. This might take a while...
Backing up files affected by the patch '9563902' for restore. This might take a while...
Backing up files affected by the patch '9537948' for restore. This might take a while...
Backing up files affected by the patch '9489355' for restore. This might take a while...
ApplySession rolling back interim patch '9563902' from OH '/u01/app/oracle/product/middleware/oms11g'

Patching component oracle.sysman.oms.core, 11.1.0.1.0...
Copying file to "/u01/app/oracle/product/middleware/oms11g/sysman/omsca/scripts/wls/create_domain.py"
RollbackSession removing interim patch '9563902' from inventory
ApplySession rolling back interim patch '9537948' from OH '/u01/app/oracle/product/middleware/oms11g'

Patching component oracle.sysman.oms.core, 11.1.0.1.0...
Updating jar file "/u01/app/oracle/product/middleware/oms11g/sysman/jlib/emInstall.jar" with "/u01/app/oracle/product/middleware/oms11g/.patch_storage/9537948_Apr_12_2010_03_37_52/files//sysman/jlib/emInstall.jar/oracle/sysman/configassistant/addon/AddOnConfigAssistantDriver.class"
Updating jar file "/u01/app/oracle/product/middleware/oms11g/sysman/jlib/emInstall.jar" with "/u01/app/oracle/product/middleware/oms11g/.patch_storage/9537948_Apr_12_2010_03_37_52/files//sysman/jlib/emInstall.jar/oracle/sysman/configassistant/addon/AddOnConfigAssistantDriver$1.class"
Updating jar file "/u01/app/oracle/product/middleware/oms11g/sysman/jlib/emInstall.jar" with "/u01/app/oracle/product/middleware/oms11g/.patch_storage/9537948_Apr_12_2010_03_37_52/files//sysman/jlib/emInstall.jar/oracle/sysman/configassistant/addon/AddOnConfigAssistantDriver$AddOnFileFilter.class"
RollbackSession removing interim patch '9537948' from inventory
ApplySession rolling back interim patch '9489355' from OH '/u01/app/oracle/product/middleware/oms11g'

Patching component oracle.sysman.oms.core, 11.1.0.1.0...
Copying file to "/u01/app/oracle/product/middleware/oms11g/bin/HAConfigCmds.pm"
RollbackSession removing interim patch '9489355' from inventory

OPatch back to application of the patch '10270073' after auto-rollback.

Backing up files affected by the patch '10270073' for rollback. This might take a while...

Patching component oracle.sysman.oms.core, 11.1.0.1.0...
Updating jar file "/u01/app/oracle/product/middleware/oms11g/sysman/jlib/emCORE.jar" with "/sysman/jlib/emCORE.jar/oracle/sysman/eml/ecm/policy/PolicyViolationsController.class"
[...]
Copying file to "/u01/app/oracle/product/middleware/oms11g/bin/SecureOMSCmds.pm"
Copying file to "/u01/app/oracle/product/middleware/oms11g/sysman/emdrep/scripts/SecureAgent_oms.pl"
ApplySession adding interim patch '10270073' to inventory

Verifying the update...
Inventory check OK: Patch ID 10270073 is registered in Oracle Home inventory with proper meta-data.
Files check OK: Files from Patch ID 10270073 are present in Oracle Home.
Execution of 'sh /tmp/10270073/custom/scripts/post -apply 10270073 ':

Return Code = 0
--------------------------------------------------------------------------------
The following warnings have occurred during OPatch execution:
1) OUI-67620:Interim patch 10270073 is a superset of the patch(es) [  9563902 9537948 9489355 ] in the Oracle Home
--------------------------------------------------------------------------------
OPatch Session completed with warnings.

OPatch completed with warnings.

This process took 1 hour on my SPARC zone-not too impressed actually. But I’m blaming it on the overloaded box instead of the patch process.The outcome of the patching looks ok to me-I already knew I applied PSU 2 as a superset of 3 other patches. The next step is to apply the post install script. This is done by calling a JDBC application as shown here:

[oms]oracle@oms $ $ORACLE_HOME/bin/rcuJDBCEngine sys/secretpassword@repositoryHost:1821:REPOSDB JDBC_SCRIPT post_install_script.sql
Extracting Statement from File Name: 'post_install_script.sql' Line Number: 1
Extracted SQL Statement: [alter session set current_schema=SYSMAN]
 Statement Type: 'DDL Statement'
Executing SQL statement: alter session set current_schema=SYSMAN
Extracting Statement from File Name: 'post_install_script.sql' Line Number: 1
Extracting Statement from File Name: 'post_install_script.sql' Line Number: 1
Extracted SQL Statement: [SET serveroutput on size 1000000]
Skipping Unsupported Statement
 Statement Type: 'Oracle RCU NotSupported SQLPlus Statement'
Extracting Statement from File Name: 'post_install_script.sql' Line Number: 2
Extracted SQL Statement: [BEGIN
 EXECUTE IMMEDIATE 'drop table bundle_component_files';
 EXCEPTION WHEN OTHERS THEN
 IF sqlcode = -942 THEN
 NULL;
 ELSE
 RAISE;
 END IF;
END;
]
....
 END;
 END IF;
 END IF;
 RAISE;
END;
]
 Statement Type: 'BEGIN/END Anonymous Block'
Completed SQL script execution normally.
1 scripts were processed
[oms]oracle@oms $

Now finally it’s time to start the OMS and cross our fingers to see if it worked:

[oms]oracle@oms $ emctl start oms
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Starting WebTier...
WebTier Successfully Started
Starting Oracle Management Server...
Oracle Management Server Successfully Started
Oracle Management Server is Up
[oms]oracle@oms $

I’d call this success-I managed to log on to the OMS and performed some basic testing which implied the patch was successfully applied.