Search

OakieTags

Who's online

There are currently 0 users and 9 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

CBO Surprise

Well, it surprised me!

I’ve said for a very long time that in principle, ignoring bugs and restrictions, the optimizer will always choose the lowest cost option during its search for an execution path. It turns out that this isn’t true. In a comment attached to a note I had written about a possible bug relating to function-based indexes I was told that there are cases where the optimizer follows a rule that allows it to ignore the lowest cost path if it is derived from a range-based predicate involving unpeekable bind variables.

The trouble is, any statement made about “bind variables” may also apply in any circumstances where the optimizer sees: “unknown value”. Here’s a simplified example that I find a little worrying (running on 11.1):

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	mod(rownum-1,20)	scattered,
	trunc(rownum / 500)	clustered,
	lpad(mod(rownum,2),10)	ind_pad,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 40000
;

create index t1_equi  on t1(scattered, ind_pad, clustered);
create index t1_range on t1(clustered, scattered);

-- call dbms_stats to compute table stats with no histograms

set autotrace traceonly explain

select
	small_vc
from
	t1
where
	scattered = 10
and	clustered between (select 40 from dual)
	          and     (select 41 from dual)
;

select
	/*+ index(t1 t1_range) */
	small_vc
from
	t1
where
	scattered = 10
and	clustered between (select 40 from dual)
	          and     (select 41 from dual)
;

The SQL with its “select constant from dual” may look a little artificial – but it represents a strategy that is used quite commonly; and it’s just one of several code patterns that can kick the optimizer into using the arithmetic for “unknown value at compile time” (another would the case where you use sys_context(), possibly in fine-grained access control – a.k.a. row-level security or virtual private database).

I’ve got two possible indexes on the table that could be used to assist this query – and one of them is significantly larger than the other because it has an extra column (ind_pad) in the middle which does not appear in the where clause. which index is the optimizer going to use ?

-- Default path
-----------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost  |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     5 |    85 |    17 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |     5 |    85 |    13 |
|*  2 |   INDEX RANGE SCAN          | T1_EQUI |     9 |       |     9 |
|   3 |    FAST DUAL                |         |     1 |       |     2 |
|   4 |    FAST DUAL                |         |     1 |       |     2 |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("SCATTERED"=10 AND "CLUSTERED">= (SELECT 40 FROM
              "SYS"."DUAL" "DUAL") AND "CLUSTERED"<= (SELECT 41 FROM "SYS"."DUAL"
              "DUAL"))
       filter("CLUSTERED">= (SELECT 40 FROM "SYS"."DUAL" "DUAL") AND
              "CLUSTERED"<= (SELECT 41 FROM "SYS"."DUAL" "DUAL"))

-- Hinted to user t1_range index
------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost  |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     5 |    85 |    10 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     5 |    85 |     6 |
|*  2 |   INDEX RANGE SCAN          | T1_RANGE |     9 |       |     2 |
|   3 |    FAST DUAL                |          |     1 |       |     2 |
|   4 |    FAST DUAL                |          |     1 |       |     2 |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("CLUSTERED">= (SELECT 40 FROM "SYS"."DUAL" "DUAL") AND
              "SCATTERED"=10 AND "CLUSTERED"<= (SELECT 41 FROM "SYS"."DUAL" "DUAL"))
       filter("SCATTERED"=10)

The optimizer has selected the path with the higher cost – and it’s hard-coded to do this in response to the nature of the predicates and the available choice of indexes. We have two indexes, with predicate usage as follows:

t1_equi (scattered, ind_pad, clustered) with predicates that are: (equality, omitted, range)
t1_range (clustered, scattered) with predicates that are: (range, equality)

The important point is that we have a predicate on column clustered which is range-based with values that are unknown at optimisation time (even though the human eye can see that the values are going to be 40 and 41). Because of this uncertainty the optimizer is coded to bypass the index that starts with this suspect column and use the (higher cost) index where there is a known starting predicate. In effect the optimizer seems to be saying: “in the worst case scenario t1_range might end up doing an index full scan but t1_equi will only have to do a partial range scan”.

I’ve still got a lot of examples I want to work through to see how far ranging this rule is and if it ever gets ignored (what happens, for example, to joins with range-based predicates), but I hope that this preliminary note acts a useful clue when you next see Oracle ignoring the lowest cost path.

Open World

I won’t be attending Open World this year but if you’re going you may want to check out the Google Spreadsheet that Greg Rahn has prepared to list the presentations from Oak Table members.

(I’ll be deleting this post when OOW 2010 is over.)

IBM System x3650 M3 Express

IBMダイレクト価格 556,500円。。。エントリレベルの価格ではない。サーバレベルの価格です。
インテル Xeonプロセッサー5600番台搭載。最大8TBのHDD、大容量192GBメモリー搭載可能。

初期メモリは2GBx3で6GBしかないようだが、結構安い。
企業で使うサーバであれば、自作というわけにはいかないだろうから、安い。

でも16GBのメモリで、SSD6本のRAID0+HDD2本と10万円を切ったXeon X5650で合計35万円で自作は出来る。

この動画は日本向けだけなのだろうか?
さすが日本IBMです。

ORA_LPENABLEを1に設定

In-Memory Parallel Queryの続き

データベースが大量のメモリーにアクセスするのがIn-Memory Parallel Queryだから、Large Pageサポート機能を検証してみた。

ラージ・ページのサポートは、Oracle Database 10gリリース1(10.1)以上の機能です。ラージ・ページのサポートにより、Windows Server 2003で実行されているメモリー集中型のデータベース・インスタンスのパフォーマンスが向上します。新たに導入されたオペレーティング・システム・サポートを利用することにより、Oracle Database 10gリリース1(10.1)以上では、プロセッサ・メモリー・アドレッシング・リソースをより効率よく使用できるようになりました。具体的には、ラージ・ページのサポートが有効になっていると、システムのCPUはRAM内のOracle Databaseバッファにより高速にアクセスできるようになります。4KBの増分でバッファをアドレッシングするかわりに、CPUはデータベース・バッファをアドレッシングする際にPhysical Address Extension(PAE)モードでは2MBのページ・サイズ、非PAEモードでは4MBのページ・サイズを使用するように指示されます。

Oracle Databaseプラットフォーム・ガイド 11gリリース1(11.1) for Microsoft Windows E05885-03

まずは、メモリ内のPageロック機能権限を与える:

そして、レジストリにORAL_LPENABLE=1をセットする:

一応Rebootして、
THP-Hベンチマークを行うと:

nocompressでIn-Memory Parallel Queryの時の結果と比べると:

少し上がったようにも見えるが、ほとんど変わらない。

DB_BUFFER_CACHEが100GBなんていう環境であれば効果はあるのだろう。
非ページング対象になるという意味ではlock_sgaやnailed_sgaパラメータと似ている。

以前書いたExpress 5800 128GBメモリ搭載機クラス用の設定だと思う。

10046 Extended SQL Trace Interpretation 3

September 6, 2010 (Back to the Previous Post in the Series) In previous blog articles in this series we looked at various methods for starting a 10046 trace file and also the basics of walking through a trace file.  Today’s blog article continues the investigation of 10046 trace files.  To begin, we need a test [...]

Thoughts on Exadata V3

I expect we’ll see some announcements at Oracle Open World in a couple of weeks (I don’t know if they will really call it V3 yet by the way).

DISCLAIMER: This is all complete conjecture on my part. I don’t work for Oracle and I have not had any conversations with any Oracle employees about official future directions or plans. (I probably would have had to sign a non-disclosure before I could have any of those discussions and then I wouldn’t be able to do this post).

text

Anyway, here’s the list of things I am thinking we might see:

Bigger Database Servers - I mean physically bigger, with open slots, so we can put HBA’s in them to attach to external storage for migration and backup purposes. Sun 4275’s perhaps since they are already using them as storage servers. The only real issue is that we’d run out of space on a full rack configuration, but if the machines are beefier, perhaps we wouldn’t need as many DB servers anyway.

Bigger Database Servers - I mean more memory and more processing power. Faster chips and bigger DIMMs are a no-brainer. Just put them in the existing 4170 boxes. But how about different models altogether. M series perhaps (which also means a change in O/S). Should be fairly easy to do actually as the DB already runs on Solaris. Might make the Sun shops really happy as well. ;-)

More Options - I expect we’ll see a little more flexibility in the configurations, because “One Size Fits All” really doesn’t (or at least many people think that it doesn’t). Anyone want a 2/3 rack?

Incrementally Better Software - It’s a great leap forward already, but I expect more things to be offloaded to the storage layer (some of the analytic functions, some of the aggregate functions, etc…)

Exadata SAN - I could see Oracle announcing a stand alone storage unit with a variable number of “trays”. They might even announce some software for doing some of the more SAN like features (think Snap Mirror).

By the way, I doubt they’ll be buying Netapp just yet (need to drive the price of the stock down a bit first I think).

Well that’s what I’m thinking. What do you think?

Sane SAN2010: Storage Arrays – Ready, Aim, Fire

OK, this one might be contentious, but what the heck – somebody has to say it. Let’s start with a question:

Raise your hand if you have a feeling, even a slight one, that storage arrays suck?

Most DBAs and sysadmins that I speak to certainly have this feeling. They cannot understand why the performance of this very large and expensive array is nearly always lower than they achieve from the hard drive in their desktop computer. OK, so the array can do more aggregate IOPs, but why is it that 13ms, for example, is considered a reasonable average response time? Or worse, why is that some of my I/Os take several hundred milliseconds? And how is it possible that my database is reporting 500ms I/Os and the array is reporting that they are all less than 10ms? These are the questions that are lodged in the minds of my customers.

Storage Arrays do some things remarkably well. Availability, for example, is something that is pretty much nailed in Storage Array Land, both at the fabric layer and the actual array itself. There are exceptions: I think that large Fibre Channel fabrics are a High Availability disaster, and the cost of entry with director-class switches makes no sense when small fabrics can be built using commodity hardware. I have a more general opinion on Fibre Channel actually – it is an ex-format, it is pushing up the lillies. More on that in another blog post, though, I’m not done with the array yet!

The Storage Array became a real success when it became possible to access storage through Fibre Channel. Until then, the storage array was a niche product, except in the mainframe world where Escon was available. Fibre Channel, and subsequently Ethernet and Infiniband, enabled the array to be a big shared storage resource. For clustered database systems this was fantastic – an end to the pure hell of multi-initiator parallel SCSI. But then things started getting a little strange. EMC, for example, started advertising during Saturday morning children’s television about how their storage arrays allowed the direct sharing of information between applications. Well, even an eight year old knows that you can’t share raw data that way, it has to go via a computer to become anything meaningful. But this became the big selling point: all your data in one place. That also has the implication that all the types of data are the same – database files, VMware machines, backups, file shares. They are not, from an access pattern, criticality or business value standpoint, and so this model does not work. Some back-pedalling has occurred since then, notably in the form of formal tiered storage, but this is  still offered under guise of having all the data in one place – just on cheap or expensive drives.

So now we have this big, all-eggs-in-one-basket, expensive array. What have we achieved by this: everything is as slow as everything else. I visited a customer last week with just such an array, and this was the straw that broke the camel’s back (it’s been a long time coming). They have a heavily optimised database system that ultimately means the following speeds and feeds are demanded from the array:

  • 10 megabytes per second write, split into around 1300 IOPs
  • 300 reads per second

If you don’t have a good feel for I/O rates let me tell you: That is a very moderate amount of I/O. And yet the array is consistently returning I/Os that are in several hundred milliseconds, both reads and writes. Quite rightly, the customer thinks this is not very acceptable. Let’s have a little analysis of those numbers.

First the writes.: Half of those writes are sequential to the Oracle redo log and could easily be serviced by one physical drive (one 15k drive can sustain at least 100MB/s of sequential I/O). The rest of them are largely random (let’s assume 100% random), as they are dirty datafile blocks being written by the database writer. Again, a single drive could support 200 random writes per second, but let’s conservatively go for 100 – that means we need six or seven drives to support the physical write requirements of the database, plus one for the redo log. Then we need to add another three drives for the reads. That makes a very conservative total of eleven drives to keep up with the sustained workload for this customer, going straight to disk without any intermediate magic. This array also has quite a chunk of write-back cache, which means that  writes don’t actually even make it to disk before they can be acknowledged to the host/database. Why then, is this array struggling to delivery low latency I/O when it has sixty four drives inside it?

The answer is that a Storage Array is just a big computer itself. Instead of taking HTTP requests or keyboard input, it takes SCSI commands over Fibre Channel. Instead of returning a web page, it returns blocks of data. And like all complex computer systems, the array is subject to performance problems within itself. And the more complex the system, the more likely it is that performance problems will arise. To make things worse, the storage arrays have increasingly encouraged the admin to turn on more and more magic in the software to the point where it is now frequently an impossibility for the storage admin to determine how well a given storage allocation might perform. Modern storage administration has more to do with accountancy than it does performance and technology. Consider this equation:

(number of features) x (complexity of feature) = (total complexity)

Complexity breeds both performance problems and availability problems. This particular customer asked me if there was a way to guarantee that, when they replace this array, the new one will not have these problems. The answer is simple: ‘no’.

Yes, we can go right through the I/O stack, including all the components and software features of the array and fix them up. We can make sure that Fibre Channel ports are private to the host, remove all other workloads from the array so that there is no scheduling or capacity problems there. We can turn off all dynamic optimisations in the array software and we can layout the storage across known physical drives. Then, and only then, might there be a slim chance of a reduced number of high latency I/Os. I have a name for this way of operating a storage array. It’s call Direct Attached Storage (DAS): Welcome to the 1990s.

Now let me combine this reality with the other important aspect: semiconductor-based  storage. What happens when the pent up frustrations of the thousands of storage array owners meets the burgeoning reality of a new and faster storage that is now governed by some kind of accelerated form of Moore’s Law? As my business partner Jeff describes it: It’s gonna be a bloodbath.

I think that we will now see a sea change in the way we connect and use storage. It’s already started with products such as Oracle’s Exadata. I’m not saying that because I am an Oracle bigot (I won’t deny that), but because it is the right thing to do – it’s focused on doing one thing well and it uses emerging technology properly, rather than pretending nothing has changed. I don’t think it’s plug and play for many transactional customers (because of the RAC implication), but the storage component is on the money. Oh, and it is effectively DAS – a virtual machine of Windows running a virus scan won’t slow down your critical invoicing run.

I think that the way we use the storage will have to change too – storage just took a leap up the memory hierarchy. Low latency connectivity such as Infiniband will become more important, as will low latency request APIs, such as SRP. We simply cannot afford to waste time making the request when the response is no longer the major time component.

With all this change, is it now acceptable to have the vast majority of I/O latency accounted for in the complex software and hardware layers of a storage array? I don’t think so.

How not to ask a question… Again…

When asking me for help, which of the following approaches should you take? (pick 1):

  1. Insult me, then ask me to help you by answering your question.
  2. Ask a question and suggest it might be nice if I included the answer to that question in an article I had written on the subject.
  3. Just ask your question.

I’m kinda expecting most people to pick option 2 or 3, but it seems some people think option 1 is the best route to getting my assistance. Bizarre.

I am aware that English is not the first language for everyone on the internet, so maybe it’s a language thing or maybe I’m just a bit sensitive. If so, please ignore this post. Of course, if you would like me to provide support that includes dealing with your abuse I am willing to do it for a fee. What about 20% of the list price of your Oracle software on a yearly basis. I guess I would then feel obliged then no matter how much of a jerk you are.

As I’ve said many times before, if you like reading my stuff, fine. If you don’t, just go elsewhere. If you see a problem with something I’ve written I would like to hear about it so I can correct it, but if you just want to throw insults I suggest you direct them to /dev/null.

Cheers

Tim…

Not NULL

Here’s a little detail that I’ve known for years – but keep forgetting until I embarrass myself by rediscovering it (usually in front of a client). I’ll start with a demonstration of a useful little feature of mandatory columns:


drop table t1 purge;

create table t1
as
select
	*
from
	all_objects
where
	rownum <= 10000
;

execute dbms_stats.gather_table_stats(user,'t1')

create index t1_i1 on t1(object_name);

set autotrace traceonly explain

select count(*) from t1;

/*

--------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time
--------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    13   (0)| 00:00:0
|   1 |  SORT AGGREGATE       |       |     1 |            |
|   2 |   INDEX FAST FULL SCAN| T1_I1 | 10000 |    13   (0)| 00:00:0
--------------------------------------------------------------------

*/

Oracle can use the index on column object_name to count the number of rows in the table because the column has been declared NOT NULL, so every row in the table also has to appear in the index. Let’s just demonstrate that by changing the column definition:


alter table t1 modify object_name null;
select count(*) from t1;

/*

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    40   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 10000 |    40   (0)| 00:00:01 |
-------------------------------------------------------------------

*/

Now let’s make the column mandatory again – by adding a constraint:


alter table t1 add constraint t1_ck_on_nn check(object_name is not null);
select count(*) from t1;

What do you think the execution path is going to be ?

Oracle uses a full table scan – even if you supply the optimizer with an index hint.

The special NOT NULL optimisation only applies if the column is declared not null – an ordinary constraint isn’t just good enough.

You may remember a short note I wrote a few days ago about adding a not null constraint to a table with 200 million rows without causing the system to stop for a few minutes. I did it with the intent of giving the optimizer a few extra options for exection plans – but now I need to go back and see whether or not the constraint has any effect at all on any of the plans we generate.

Update 6th Sept 2010

Some of the early comments have highlighted the importance of being very careful with lnaguage when describing a problem. In this case I haven’t made the distinction between an “ordinary constraint” and a “not null declaration” clear enough.  To ensure that a column does not allow nulls you can add a check constraint, or you can declare it “not null” – which, in effect, is a special type of constraint. When you declare a column to be not null Oracle will create a check constraint for that column, but also set the null$ column in sys.col$ to be non-zero.

A quick cut and paste job to show the differences (in 11.1)

SQL> alter table t1 modify n1 not null;

Table altered.

SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 N1                      NOT NULL NUMBER
 N2                               NUMBER
 V1                               VARCHAR2(180)

SQL> alter table t1 modify n1 null;

Table altered.

SQL> alter table t1 add constraint t1_ck_n1_nn check(n1 is not null);

Table altered.

SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 N1                               NUMBER
 N2                               NUMBER
 V1                               VARCHAR2(180)

SQL> alter table t1 drop constraint t1_ck_n1_nn;

Table altered.

SQL> alter table t1 modify n1 not null enable novalidate;

Table altered.

SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 N1                               NUMBER
 N2                               NUMBER
 V1                               VARCHAR2(180)

SQL

I haven’t queried user_constraints to show it but in all these cases there will be a check constraint with the search_condition: “N1″ IS NOT NULL. As you can see, though, the only time that the describe command shows the column as NOT NULL (and col$.null$ becomes non-zero) is when you use the not null declaration with the implicit attributes of enable, validate.

It’s only this declaration/setting of null$ that allows the optimizer to do its special optimisation. You might have thought that the optimizer could be programmed to read the constraint information to look for a “check column is not null” constraint to do the same trick but (at present) it simply doesn’t work that way.

Grand Canyon

Some of the boys and girls have been very busy over the summer, fulfilling a life-long ambition, visiting Las Vegas but, more important, the magnificent Grand Canyon.

<!-- s9ymdb:298 -->

But Gerald had a pretty narrow escape when he was almost left behind.

<!-- s9ymdb:299 -->