Search

OakieTags

Who's online

There are currently 0 users and 10 guests online.

Recent comments

Affiliations

performance

10046 Extended SQL Trace Interpretation 3

September 6, 2010 (Back to the Previous Post in the Series) In previous blog articles in this series we looked at various methods for starting a 10046 trace file and also the basics of walking through a trace file.  Today’s blog article continues the investigation of 10046 trace files.  To begin, we need a test [...]

My OTN Interview at OOW2010 (which hasn’t happened yet)

Yesterday, Justin Kestelyn from Oracle Technology Network sent me a note specifying some logistics for our OTN interview together called “Coding for Performance, with Cary Millsap.” It will take place at Oracle OpenWorld on Monday, September 20 at 11:45am Pacific Time.

One of Justin’s requests in his note was, “Topics: Please suggest 5 topics for discussion?” So, I thought for a couple of minutes about questions I’d like him to ask me, and I wrote down the first one. And then I thought to myself that I might as well write down the answer I would hope to say to this; maybe it’ll help me remember everything I want to say. Then I wrote another question, and the answer just flowed, and then another, and another. Fifteen minutes later, I had the whole thing written out.

I told Justin all this, and we agreed that it would be fun to post the whole interview here on my blog, before it ever happened. And then during the actual interview, we’ll see what actually happens. It’ll all be in Justin’s hands by then.

So, here we go. Justin Kestelyn’s interview, “Coding for Performance, with Cary Millsap.” Which hasn’t happened yet.

◆ ◆ ◆

Justin: Hi, Cary. Welcome to the show, etc., etc.

Cary: Hi, Justin. It’s great to be here. Thank you for having me, etc., etc.

Justin: So tell me, ... What is the most important thing to know about performance?

Fair Comparison

From time to time someone will post a question about query performance on the OTN database forum asking why one form of a query returns data almost immediately while another form of the query takes minutes to return the data.

Obviously there are all sorts of reasons – the optimizer is not perfect, and different transformations may take place that really do result in a huge differences in work done by two queries which return the same result set – but a very simple reason that can easily be overlooked is the front-end tool being used to run the query.

Because of its popularity, Toad is a common “culprit”. You type in your query, you hit the execute key or icon, the query runs, and the bottom of the screen shows a nice spreadsheet representation of your result set – but not necessarily immediately.

It’s important to remember that when you do this, Toad (or whatever your favourite GUI is) may only be showing you the first few rows of the result set. If you see a massive difference in apparent performance between two versions of a query, it may be that Oracle has used two different strategies to get the entire data set. For example – imagine your query is supposed to return 5,000 rows -

Version A unnests an aggregate subquery, does complex view merging, and aggregates late. No data appears for three minutes and then the first 20 rows appear in the spreadsheet.

Version B runs the same aggregate subquery as a filter subquery, the first 20 rows appears instantly.

Quiz Night.

I was on a customer site recently where I needed to add a NOT NULL constraint to a table of 200 million rows – without taking any downtime. It’s not difficult (provided you are happy with a check constraint rather than a column definition.)

alter table t1
	add constraint t1_ck_colX_nn check (colX is not null)
	enable novalidate
;

The first step creates the constraint and enables it – but doesn’t validate it. This means that future data (and changes) will obey the constraint, but there may be illegal data already in the table that will not be checked. You will have an interruption to service doing this, as your session will wait to lock the table in share mode (mode 4) to add the constraint – so will be blocked by current update transactions, and will block new update transactions. In a typical OLTP system this should result in just a brief pause.

The second step validates the constraint, which needs a slightly more complex piece of code – perhaps something like the following:

declare
	resource_busy exception;
	pragma EXCEPTION_INIT(resource_busy, -54);
begin
	loop
		begin
			execute immediate
			'alter table t1 modify constraint t1_ck_colX_nn validate';
			dbms_output.put_line('Succeeded');
			exit;
		exception
			when resource_busy then
				dbms_output.put_line('Failed');
		end;
		dbms_lock.sleep(0.01);
	end loop;
end;
/

This code tries to validate the constraint and goes into a loop, sleeping for 1/100 second, if it hits the “resource busy” error. Unlike the call to add the constraint, the call to validate it doesn’t wait for a lock – it requests an immediate lock, so it’s going to fail immediately if there are any active transactions on the table. Once it has the lock it drops it, so you’re not going to shut your users out while the validation takes place.

Index rebuilds

A couple of years ago I wrote about a poster on the OTN db forum who was puzzled by the fact that when he started rebuilding tables they got bigger.  (Bad luck, sometimes that’s what happens !)

A few days ago a related question appeared: I rebuilt some indexes and my query got slower. (Bad luck, sometimes that’s what happens – again!)

If you rebuild an index it’s physically different and its statistics are different. Plans can change and go slower because the index stats look sufficiently different; plans can stay the same and go slower because the index is physically different. 

I’ve added a couple of comments to the thread – there may still be some further mileage in it.

SaneSAN2010: Serial to Serial – When One Bottleneck Isn’t Enough

I was recently looking into a storage-related performance problem at a customer site. The system was an Oracle 10.2.0.4/SLES 9 Linux system, Fibre Channel attached to an EMC DMX storage array. The DMX was replicated to a DR site using SRDF/S.

The problem was only really visible during the overnight batch runs, so AWR reports were the main source of information in diagnosis. In this case, they were more than sufficient, showing clear wait spikes for ‘free buffer waits’ and ‘log file parallel write’ during the problematic period. They were quite impressive, too – sixteen second latencies for some of the writes.

The customer was not oblivious to this fact, of course – it is difficult not to see a problem of such magnitude. They already had an engineering plan to move from SRDF/S (Synchronous) to SRDF/A (Asynchronous), as it was perceived that SRDF was the guilty party in this situation. I had been asked to validate this assumption and to determine the most appropriate roadmap for fixing these I/O problems on this highly critical system.

Of course, SRDF/S will always get the blame in such situations :) I have been involved with many such configurations, and can indeed attest that SRDF/S can lead to trouble, particularly if the implications are not understood correctly. In this case, a very large financial institution, the storage team did indeed understand the main implication (more on this shortly) but, as is often the case in large organisations, the main source of the problem was actually what I call a boundary issue, one which falls between or across the technology focus of two or more teams. In this case, there were three teams involved, leading to configuration issues across all three areas.

Sane SAN 2010 – Introduction

This year at the UKOUG Conference in Birmingham, acceptance permitting, I will present the successor to my original Sane SAN whitepaper first penned in 2000. The initial paper was spectacularly well received, relatively speaking, mostly because disk storage at that time was very much a black box to DBAs and a great deal of mystique surrounded its operation. Well, nothing much has changed on that front, so I figured it was very much time to update/rewrite the paper for modern technology and trends and try to impose my occasionally humble opinion on the reader :)

I’ve already an article on this blog that will form part of the paper, and I will write a few more over the next few weeks. Check out this one, and keep an eye on my blog for the next few weeks. The first one out the bag is “Serial to Serial – When One Bottleneck Isn’t Enough.”

If the UKOUG don’t accept it, I’ll post it on the Scale Abilities website anyway, and try to palm it off on some other conferences in 2011 :)

Cardinalilty One

I think anyone who has read Wolfgang Breitling’s material about the optimizer will be familiar with the concept of Cardinality Feedback and one particular detail that when Oracle gets a cardinality estimate of one for a “driving” table then there’s a good chance that the execution plan will go wrong. (That’s not rule, by the way, just a fairly common observation after things have gone wrong.)

A recent note on OTN reminded me of a particular scenario where this specific problem can occur. It’s not particularly common, but it may hit people who are building data warehouses from multiple different sources. We start with an unlikely looking data set and very simple query:

drop table t1;

create table t1 as
select
	rownum id1,
	rownum id2
from
	all_objects
where
	rownum <= 10000
;

execute dbms_stats.gather_table_stats(user,'t1');

set autotrace traceonly

select
	count(*)
from
	t1
where
	id1 = id2
;

What do you think Oracle estimated cardinality will be for this predciate ? We know, because we saw the data being built, that we’re going to identify 10,000 rows. But the optimizer doesn’t see it that way – check line 2 of the execution plan. The optimizer thinks it will find just one row:

Joins – MJ

The final join mechanism in my “all joins are nested loop joins” argument is the Merge Join – a join mechanism that depends on both its row sources being pre-sorted on the join columns. The note on hash joins pointed out that a “traditional” nested loop join may result in repeated visits to data blocks [...]

Buffer Cache Hit Ratio, Optimizer Index Cost Adj, Clustering Factor, Performance Testing – How to Break a Test Computer

August 14, 2010 My replacement laptop arrived this week to fix an occasional problem that I had with the old laptop: I thought that I would perform a test on the new laptop, so I thought that I would try a slightly modified version of the script that appeared in the “Expert Oracle Practices” book in the Buffer [...]