Search

Top 60 Oracle Blogs

Recent comments

December 2009

user centered design

It's the day before the new year and I have two topics related to development that I am hoping to complete today to wrap up 2009. Then I plan to refocus this blog on its originally intended topic - measurement. I can't promise not to wander off topic again but I will do my best.A few weeks ago, Cary Millsap sent me a link to The Fable of the User-Centered Designer. It's short and an enjoyable

50+ SQL Performance Optimization scenarios

Before the year ends I’d like to share some good stuff…

I have never seen a huge compilation of SQL tuning tips or rewrite scenarios (with test cases) and got them only on OracleFans forum… ooops… so you can’t read Chinese? try this translated version, whew.. good thing Google has this translate service that I am able to read in Chinese.. </p />
</p></div>

    	  	<div class=

Oracle Open World 2009 Report - Part Two


Tuesday October 13th

Unconference on Indexes
Richard Foote
10/13/2009 10:00 AM

I started off the day attending the indexing presentation of fellow Oak Table member Richard Foote.  Foote has become quite well known for his expertise on index internals since the publication of Oracle B-Tree Index Internals: Rebuilding the Truth

This was basically a Q&A session, and I will include just a couple of the questions.

Q: Have you ever seen an index Skip Scan used correctly?
A: The short answer was 'No'

Instance_Number is busy Message during Standby Instance Startup

Recently I encountered a situation that demonstrates how, in an Oracle database, an error may stem from a very unrelated cause. A DBA was building a physical standby database for an upcominmg training. The two servers he was using were part of a RAC cluster; so the Oracle binaries were already there. He decided to use the same ORACLE_HOME for the new databases as well - a quite logical decision to savbe on space and administration issues. He created a primary database on server n1 and a standby database on the server n2. Follooing the usual manual procedure in building the standby database, he copied the pfile from primary database, modified the parameters and brought up the standby instance in nomount mode on the server n2.

SQL> startup nomount pfile=initSTBY.ora

But it refused to come up, with the following error:

ORA-00304: requested INSTANCE_NUMBER is busy

Alert log showed:

USER (ospid: 14210): terminating the instance due to error 304
Instance terminated by USER, pid = 14210

This was highly unusual. The primary and standby both were non-RAC; there was no instance number concept in a non-RAC database. By the way, the RAC instance on that server (or on the server n1) was not running; so there was no question of any conflict with the RAC instances either. The primary database was called PRIM while the standby was called STBY - eliminating the possibility of an instance name clash as well. And this error came while merely trying to start the instance, not even while mounting - eliminating the standby controlfile as a cause.

The error 304 showed:

00304, 00000, "requested INSTANCE_NUMBER is busy"
// *Cause: An instance tried to start by using a value of the
// initialization parameter INSTANCE_NUMBER that is already in use.
// *Action: Either
// a) specify another INSTANCE_NUMBER,
// b) shut down the running instance with this number
// c) wait for instance recovery to complete on the instance with
// this number.

Needless to say, being for a non-RAC database there was no "instance_number" parameter in the initialization parameter file of primary or the standby. So, the suggestions for the resolution seemed odd. MetaLink provided no help. All the ORA-304 errors were related to RAC with the instance_number mismatch.

As it always happens, it fell on my lap at this time. With just days to go live, I had to find a solution quickly. Long hours of troubleshooting, tracing the processes and examination of occasional trace files did not yield any clue. All the clue seemed to point to RAC, which this database was not. The Oracle Home was a RAC home, which meant the oracle binary was linked with the "rac" option.

So, the next logical step was to install a new Oracle Home without the rac option. After doing so, I tried to bring up the instance, using the new ORACLE_HOME and LD_LIBRARY_PATH variable; but, alas, the same error popped up.

Just to eliminate the possibility of some unknown bug, I decided to put an instance_number parameter, setting it to "1", from the default "0". The same error. I changed it to "2", again, the result was the same error.

Although this didn't help, at least it gave a clue that the error was not related to instance_number. The error message was clearly wrong. With this in mind, I went back to the basics. I went through the alert log with a fine toothed comb, scanning and analyzing each line.

The following line drew my attention:

DB_UNQIUE_NAME STBY is not in the Data Guard configuration

This was odd; the db_unique_name STBY was not defined in DG configuration. [BTW, note the spelling of "unique" in the message above. That is not what I typed; it was a copy and paste from the actual message in the alert log. Someone in Oracle Development should really pay atytention to typos in messages. This is clearly more than a nuisance; what if some process scans for db_unique_name related errors? It will not find the message at all!]

Checking the dg configuration, I found that the DBA has correctly defined the primary and standby names. In any case, Data Guard has not been started yet; this is merely at the instance startup - why is it complaining for data guard configuration at this time.

Perplexed, I resorted to a different approach. I renamed the pfile and all other relevant files. Then I built the standby myself, from scratch - using the same names - PRIM and STBY. And this time, everything worked fine. The instance STBY did come up.

While this solvbed the urgency problem, everyone, inclduing myself, wanted to know what the issue was in the earlier case where the DBA had failed to bring up the instance. To get the answer, I compared the files I created with the DBA created when tried and failed. Voila! The cause was immediately clear - the DBA forgot to put a vital parameter in the pfile of the standby instance:

db_unique_name = 'STBY'

This parameter was absent; so it took the default value as the db_name, which was "PRIM". This caused the instance to fail with a seemingly unrelated message - "ORA-304 Instance_number is busy"!

Learning Points

  1. In Oracle, most errors are obvious; but some are not. So, do not assume the error message is accurate. If all logic fails, assume the error messsage is wrong, or at least inaccurate.
  2. Instead of over-analyzing the process already followed, it may make sense to take a breather, wipe out everything and start fropm scratch. This is evben mor effective when someone else does it, offering a fresh approach and possibly not repeating the same mistakes.
  3. Finally, the issue at hand: if you do not define db_unique_name parameter in the standnby instance, you will receive ORA-304 during instance startup.

Hope this was helpful. Happy New Year, everybody.

The “Do What You Love” Mirage

I am inspired by having read an article called “Do what you love mirage” by Denis Basaric. It begins...

“Do what you love” is advice I hear exclusively from financially secure people. And it rings hollow to me. When you need money to survive, you do any work that is available, love does not play into that choice. Desperation does.

Please read it before you go on.

Welcome back.

This article puts a very important cycle within my life into words. I believe, as Denis says, that a lot of times, we get the cause-effect relationship mixed up when we think about loving what we do.

I love what I do. Well, a lot of it. But Denis is right: I didn’t choose what I do out of love. I chose what I love out of doing. Some examples:

Hinting Sub-Queries on Oracle

This posting is a purely Oracle RDBMS discussion about how to correctly apply hints to sub-queries. However, it is particularly relevant to PeopleSoft, which it makes extensive use of correlated sub-queries. The point of this story is not the particular hints that I applied, but where I placed the hints, and how I scoped them to the sub-queries.

The following SQL extract is from the delivered Global Payroll GPGB_EDI process (although I have already made some minor changes, and PS_GP_RSLT_PIN is partitioned). Notice that each sub-query joins two tables together. PS_GP_RSLT_PIN is second largest of Global Payroll result tables. PS_GP_PIN is a look-up table, and the criterion on PIN_CODE will only return a single row.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
UPDATE Table(GPGB_EDIE_TMP) X

Hinting Sub-Queries on Oracle

This posting is a purely Oracle RDBMS discussion about how to correctly apply hints to sub-queries. However, it is particularly relevant to PeopleSoft, which it makes extensive use of correlated sub-queries. The point of this story is not the particular hints that I applied, but where I placed the hints, and how I scoped them to the sub-queries.

The following SQL extract is from the delivered Global Payroll GPGB_EDI process (although I have already made some minor changes, and PS_GP_RSLT_PIN is partitioned). Notice that each sub-query joins two tables together. PS_GP_RSLT_PIN is second largest of Global Payroll result tables. PS_GP_PIN is a look-up table, and the criterion on PIN_CODE will only return a single row.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
UPDATE Table(GPGB_EDIE_TMP) X

RAC performance tuning: Understanding Global cache performance

Global cache performance metrics are not correctly measured. It is not understood clearly either. There are even few blogs and web pages disseminating incorrect information. This blog entry is an attempt to offer few methods and scripts to understand global cache performance.

Always review all instances

It is very important to review the performance metrics from all instances in that RAC cluster, not just one instance that you are connected. If you have access to AWR reports, then it is critical to generate AWR reports (or statspack reports) from all instances. But, the problem is that, DBAs tend to generate AWR reports after logging in to each instance iteratively, enter couple of parameters and then reports are generated. Not exactly a convenient practice.

  REM connect to each instance separately, type in the beginning snap_id and ending snap_id for each node etc..
   sqlplus mydba@proddb1
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;
   sqlplus mydba@proddb2
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;
   sqlplus mydba@proddb3
   @$ORACLE_HOME/rdbms/admin/awrrpt.sql
   exit;

There are few issues with this approach. It is a cumbersome practice if the instance count is higher. In addition to that, all of AWR reports are, in turn, accessing underlying AWR tables. Physically, rows from all instances are together in the same block and so, by executing these reports connecting to various instances, Global cache traffic is increased. If the database is suffering from Global cache (GC) performance issues then generating reports connecting to various instances is probably not a grand idea.

I have created few basic scripts to generate AWR reports from all instances in a single step. Script awrrpt_all_gen.sql queries AWR tables and gv$instance to access last snap_id and various instance_ids. Then this script generates AWR reports for all instances in that cluster in the current directory, for the last AWR snap. This script is quite handy while debugging RAC performance issues. Now, we have AWR reports from all instances in one directory and we can easily grep these files to understand metrics in various instances.

@awrrpt_all_gen.sql

awrrpt_all_gen.sql v1.00 by Riyaj Shamsudeen @orainternals.com

To generate AWR Report from all RAC instances concurrently.
Creates reports using last two snap_ids.

...Generating awrrpt_all.sql script.... Please wait....
...Completed script generation.

Executing awrrpt_all.sql to generate AWR reports.
...Generates AWR reports with file name format awrrpt___.txt for each instance.
...Please wait for few minutes...

...AWR report created for instance 1. Please wait..
...AWR report created for instance 2. Please wait..
...AWR report created for instance 3. Please wait..
...AWR report created for instance 4. Please wait..

AWR reports created.

If you would like ability to choose the snap_ids, then use the script awrrpt_all_range_gen.sql

Use granular approach to GC measurements

It is important to measure Global cache performance metrics correctly, especially those damned averages. Incorrect measurements of averages can lead to faulty analysis. For example, following query is retrieving average global cache CR receive time from gv$sysstat view. Problem with this SQL statement is that this statement is retrieving average global cache receive time from the restart of the instances. Global cache performance is quite sensitive to workload and querying over a wide time range, as in this query, can lead to faulty analysis.


REM This is really a bad idea!!

select b1.inst_id, b2.value "RECEIVED",
b1.value "RECEIVE TIME",
((b1.value / b2.value) * 10) "AVG RECEIVE TIME (ms)"
from gv$sysstat b1, gv$sysstat b2
where b1.name = 'gc current block receive time' and
b2.name = 'gc current blocks received' and b1.inst_id = b2.inst_id;

As an analogy, above query is equivalent to a patient-Doctor conversation:

Patient: Doctor, I don’t feel good. I feel that I am having fever.
Doctor : Nope. Your average body temperature from the beginning of this year is around 98. So, you are doing excellent.

It is essential to review the global cache performance metrics at a granular level. Use AWR reports or statspack reports for a short period of time. Better yet, if you want to quickly review GC performance metrics, you can use my script gc_traffic_print.sql. This script captures various statistics relevant to Global cache performance in to a PL/SQL table, sleeps for N seconds (default N=60) and re-captures the statistics after the sleep. Then the script prints all the relevant details in easily readable tabular format. Very useful in quickly reviewing the Global cache performance metrics.

SQL> @gc_traffic_print.sql

gc_traffic_print.sql v1.20 by Riyaj Shamsudeen @orainternals.com

...Calculating GC Rx and Tx timing and blocks..
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...
Enter value for sleep: 30
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|
Inst     | CR blocks Rx | CR time |  CUR blocks Rx | CUR time |  CR blocks Tx | CUR blocks Tx |Tot blocks   |
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|
1        |         29288|    12.17|          14914|      4.58|          27059|           17439|        88700|
2        |         21201|     5.75|          12835|      5.19|          26831|           16741|        77608|
3        |         22942|     4.64|          11751|      3.82|          23918|            9205|        67816|
4        |         32075|     4.78|          19260|      4.55|          31046|           19331|       101712|
---------|--------------|---------|----------------|----------|---------------|---------------|-------------|

PL/SQL procedure successfully completed.

Column key for the above script is:
CR block RX : GC CR blocks received
CR time : Average GC CR receive time
CUR blocks RX: GC CUR blocks received
CUR time: Average GC CuR receive time
CR blocks TX: GC CR blocks transmitted
CUR blocks TX: GC CUR blocks transmitted
tot blocks: Sum of transmitted + received for both CR and CUR traffic

Breakup: global cache performance

Global cache processing performance can be broken down to its sub components ( excluding wire latency, but that can be calculated). Breakup of this metrics must be done at a granular level also. We can use yet another script gc_traffic_processing.sql to review the performance metrics for the past N seconds. For example, in the output printed below, Global cache CR processing time is broken in to various sub-components of global cache processing. Remember that these performance metrics are in the SEND side of Global cache performance metrics. Meaning, inefficient global cache processing in one instance affects other instances' global cache receive time, but not the self. For example, in the table above, GC CR receive performance was at 12.1ms for instance 1. That is due to another instance (probably node 4 in this case) suffering from high GC processing time (more specifically high 'CR block flush time' in node 4). It is far more accurate to review global cache performance issues in a granular basis. Caution though, try not to specify at least a minute so that these numbers are consistent.

@gc_traffic_procssing.sql
gc_traffic_processing.sql v1.00 by Riyaj Shamsudeen @orainternals.com

...Prints various timing related information for the past N seconds
...Default collection period is 60 seconds.... Please wait for at least 60 seconds...
Enter value for sleep:

---------|-----------|---------|-----------|----------|------------|------------|------------|----------|
Inst     | CR blk Tx | CR bld  | CR fls tm | CR snd tm| CUR blk TX | CUR pin tm | CUR fls tm |CUR snd tm|
---------|-----------|---------|-----------|----------|------------|------------|------------|----------|
1        |      62732|       .2|       1.25|       .21|       48920|        .96|         .22|        .22|
2        |      46037|      .13|        .76|        .2|       42019|        .68|         .17|         .2|
3        |      72237|      .63|        .79|       .19|       22697|        .44|         .29|        .23|
4        |     113816|      .13|       2.04|       .15|       51020|        .87|         .12|        .13|
--------------------------------------------------------------------------------------------------------

PL/SQL procedure successfully completed.

Here is the column definitions for the script output:
CR blk TX: GC CR blocks served
CR bld : Average GC CR build time/CR block served
CR fls tm: Average GC CR flush time/CR block served
CR snd tm: Average GC CR send time/CR block served
CUR blk TX: GC CUR blocks served
CUR pin tm: Average GC Current pin time /CUR block served
CUR fls tm : Average GC Current flush time/CUR block served
CUR snd tm: Average GC current send time/CUR block served

Review histograms for GC events

Averages can be misleading. Revealing performance issues underneath the average is a very important task to do. For example, we see that 'gc cr block receive time' average is high. But, let's review the histogram for the gc events related to Global cache receive time. My script gc_event_histogram.sql is quite handy to review event histograms. This script provides a difference in performance statistics between two samples and prints them in a readable format. In this case, node 1 is has 8578 waits for the event 'gc cr block 2-way' under 2ms and the sessions in instance 1 had 9851 waits between 2-4ms. Only few waits over 16ms.

@gc_event_histogram.sql

gc_event_histogram.sql v1.0 by Riyaj Shamsudeen @orainternals.com

Default collection period is sleep seconds. Please wait..
Enter value for event: gc cr block 2-way
Enter value for sleep: 60
---------|-----------------------|----------------|----------|
Inst id  | Event                 |wait time milli |wait cnt  |
---------|-----------------------|----------------|----------|
1        |gc cr block 2-way      |               1|        98|
1        |gc cr block 2-way      |               2|      8578|
1        |gc cr block 2-way      |               4|      9851|
1        |gc cr block 2-way      |               8|      2291|
1        |gc cr block 2-way      |              16|       117|
1        |gc cr block 2-way      |              32|         6|
1        |gc cr block 2-way      |              64|         1|
1        |gc cr block 2-way      |             128|         2|
1        |gc cr block 2-way      |             256|         0|
1        |gc cr block 2-way      |             512|         0|
1        |gc cr block 2-way      |            1024|         0|
1        |gc cr block 2-way      |            2048|         0|
1        |gc cr block 2-way      |            4096|         0|
1        |gc cr block 2-way      |            8192|         0|
1        |gc cr block 2-way      |           16384|         0|
2        |gc cr block 2-way      |               1|       155|
2        |gc cr block 2-way      |               2|     10792|
2        |gc cr block 2-way      |               4|     14201|
2        |gc cr block 2-way      |               8|      3887|
2        |gc cr block 2-way      |              16|       204|
...

But, we also need to see 3-way waits. Let's review 3-way waits with the above script. We will simply supply 'gc cr block 3-way' as the wait event. In this case, we see that there are 12,062 waits over between 4-8ms. Of course, this must be improved.


@gc_event_histogram.sql

gc_event_histogram.sql v1.0 by Riyaj Shamsudeen @orainternals.com

Default collection period is sleep seconds. Please wait..
Enter value for event: gc cr block 3-way
Enter value for sleep: 60
---------|-----------------------|----------------|----------|
Inst id  | Event                 |wait time milli |wait cnt  |
---------|-----------------------|----------------|----------|
1        |gc cr block 3-way      |               1|         3|
1        |gc cr block 3-way      |               2|       320|
1        |gc cr block 3-way      |               4|     17017|
1        |gc cr block 3-way      |               8|     12062|
1        |gc cr block 3-way      |              16|      1725|
1        |gc cr block 3-way      |              32|       113|
1        |gc cr block 3-way      |              64|         0|
1        |gc cr block 3-way      |             128|         0|
1        |gc cr block 3-way      |             256|         0|
1        |gc cr block 3-way      |             512|         0|
1        |gc cr block 3-way      |            1024|         0|
1        |gc cr block 3-way      |            2048|         0|
1        |gc cr block 3-way      |            4096|         0|
1        |gc cr block 3-way      |            8192|         0|
...

In essence, these four scripts will be useful in debugging Global cache performance issues. Please let me know if you encounter errors in these scripts. We will use these scripts in future discussions to show how we resolved few RAC performance issues.

Merry Christmas (Peace On Earth)

Wow, yet another year almost behind us and what looks like being yet another busy year ahead. Can you believe that The Noughties are nearly all gone, I remember being all excited at the dawn of The Eighties !! I must say I’m really looking forward to finally saying “Twenty Ten” rather than the more long-winded “Two [...]

The Core Performance Fundamentals Of Oracle Data Warehousing – Balanced Hardware Configuration

[back to Introduction] If you want to build a house that will stand the test of time, you need to build on a solid foundation. The same goes for architecting computer systems that run databases. If the underlying hardware is not sized appropriately it will likely lead to people blaming software. All too often I [...]