Search

OakieTags

Who's online

There are currently 0 users and 23 guests online.

Recent comments

Oakies Blog Aggregator

60TB Disk Drives?

I was reading a story where Seagate were talking about 60TB disk drives. That’s all well and good, but how quick can I get data to and from them? If I need a certain number of spindles to get the performance I require, then I’m just going to end up with masses of wasted capacity.

I can picture the scene now. I have a database of “x” terabytes in size and I need “y” number of spindles to get the performance I require, so I end up having to buy disks amounting to “z” petabytes of space to meet my performance needs. Not only is it hard to justify, but you know the “spare” capacity will get used to store stuff that’s got nothing to do with my database.

Just think of those 60TB bad-boys in a RAID5 configuration. Shudder. :)

Feel free to insert a, “SSD/Flash will solve the worlds storage problems”, comment of your choice here. :)

Cheers

Tim….


60TB Disk Drives? was first posted on May 23, 2012 at 1:03 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




VirtualBox 4.1.16 Released…

VirtualBox 4.1.16 has been released. You can download it from the usual place and see the fixes in the changelog.

I’ve downloaded the Fedora 17 build, ready for the next weeks final release of Fedora 17… :)

Cheers

Tim…


VirtualBox 4.1.16 Released… was first posted on May 22, 2012 at 8:58 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




All about RAC and MTU with a video

Let’s first discuss how RAC traffic works before continuing. Environment for the discussion is: 2 node cluster with 8K database block size, UDP protocol is used for cache fusion. (BTW, UDP and RDS protocols are supported in UNIX platform; whereas Windows uses TCP protocol).

UDP protocol, fragmentation, and assembly

UDP Protocol is an higher level protocol stack, and it is implemented over IP Protocol ( UDP/IP). Cache Fusion uses UDP protocol to send packets over the wire (Exadata uses RDS protocol though).

MTU defines the Maximum Transfer Unit of an IP packet. Let us consider an example of MTU set to 1500 in a network interface. One 8K block transfer can not be performed with just one IP packet  as the IP packet size (1500 bytes) is less than 8K. So, one transfer of UDP packet of 8K size is fragmented to 6 IP packets and sent over the wire. In the receiving side, those 6 packets are reassembled to create one UDP buffer of size 8K. After the assembly, that UDP buffer is delivered to an UDP port of a UNIX process. Usually, a foreground process will listen on that port to receive the UDP buffer.

Consider what happens If MTU is set to 9000 in the network interface:  Then 8K buffer can be transmitted over the wire with just one IP packet. There is no need for fragmentation or reassembly with MTU=9000 as long as the block size is less than 8K. MTU=9000 is also known as jumbo frame configuration.  ( But, if the database block size is greater than jumbo frame then fragmentation and reassembly is still required. For example, for 32KB size, with MTU=9000,  there will three 9K IP packets  and one 5K IP packet to be transmitted).

Fragmentation and reassembly is performed at OS Kernel layer level and hence it is the responsibility of Kernel and the stack below to complete the fragmentation and assembly. Oracle code simply calls the send and receive system calls, passes the buffers to populate.

Few LMS system calls in Solaris platform:

0.6178  0.0001 sendmsg(30, 0xFFFFFFFF7FFF7060, 32768)          = 8328
0.6183  0.0004 sendmsg(30, 0xFFFFFFFF7FFFABE0, 32768)          = 8328
0.6187  0.0001 sendmsg(36, 0xFFFFFFFF7FFFBA10, 32768)          = 144
...
0.7241  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192
0.7243  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192

UDP vs TCP

If you talk to a network admin about use of UDP for cache fusion, usually, there will be few eyebrows raised about the use of UDP. From RAC point of view, UDP is the right choice over TCP for cache fusion traffic. With TCP/IP, for every packet transfer has overhead, connection need to be setup, packet sent, and the process must wait for TCP Acknowledgement before considering the packet send as complete. In a busy RAC systems, we are talking about 2-3 milli-seconds for packet transfer and with TCP/IP, we probably may not be able to achieve that level of performance. With UDP, packet transfer is considered complete, as soon as packet is sent and error handling is done by Oracle code itself. As you know, reliable network is a key to RAC stability, if much of packets (closer to 100%) are sent without any packet drops, UDP is a good choice over TCP/IP for performance reasons.

If there are reassembly failures, then it is a function of unreliable network or kernel or something else, but nothing to do with the choice of UDP protocol itself. Of course, RDS is better than UDP as the error handling is offloaded to the fabric, but usually require, infiniband fabric for a proper RDS setup. For that matter, VPN connections use UDP protocol too.

IP identification?

In a busy system, there will be thousands of IP packets traveling in the interface, in a given second. So, obviously, there will be many IP packets from different UDP buffers received by the interface. Also, because these ethernet frames can be delivered in any order, how does Kernel know how to assemble them properly? More critically, how does the kernel know that 6 IP packets from one UDP buffer belongs together and the order of those IP packets?

Each of these IP packet has an IP identification and fragment offset. Review the wireshark files uploaded in this blog entry, you will see that all 6 IP packets will have the same IP identification. That ID and the fragment offset is used by the kernel to assemble the IP packets together to create UDP buffer.

Identification: 0x533e (21310)
..
Fragment offset: 0

Reassembly failures

What happens if an IP packet is lost, assuming MTU=1500 bytes?

From the wireshark files with mtu1500, you will see that each of the packet have a Fragment offset. That fragment offset and IP identification is used to reassemble the IP packets to create 8K UDP buffer. Consider that there are 6 puzzle pieces, each puzzle piece with markings, and Kernel uses those markings( offset and IP ID) to reassemble the packets. Let’s consider the case, one of 6 packet never arrived, then the kernel threads will keep those 5 IP packets in memory for 64 seconds( Linux kernel parameter ipfrag_time controls that time) before declaring reassembly failure. Without receiving the missing IP packet, kernel can not reassemble the UDP buffer, and so, reassembly failure is declared.

Oracle foreground process will wait for 30 seconds (it used to be 300 seconds or so in older version of RAC) and if the packet is not arrived within that timeout period, FG process will declare a ‘GC lost packet’ and re-request the block. Of course, kernel memory allocated for IP fragmentation and assembly is constrained by Kernel parameter ipfrag_high_thres and ipfrag_low_thres and lower values for these kernel parameters can lead to reassembly failures too (and that’s why it is important to follow all best practices from RAC installation guides).

BTW, there are few other reasons for ‘gc lost packets’ too. High CPU usage also can lead to ‘gc lost packets’ failures too, as the process may not have enough cpu time to drain the buffers, network buffers allocated for that process becomes full, and so, kernel will drop incoming packets.

It is probably better to explain these concepts visually. So, I created a video. When you watch this video, notice that there is HD button on the top of the video. Play this in HD mode so that you will have better learning experience.

You can get the presentation file from the video here: MTU

Wireshark files explained in the video can be reviewed here:
wireshark_1500mtu
wireshark_9000mtu

BTW, when you review the video, you will see that I had little bit trouble identifying the packet in the wireshark output initially. I understood the reason for not seeing the packets filled with DEADBEEF characters. Why do you think I didn’t see the packets initially?

Also, looks like, video quality is not that great when embedded. If you want actual mp4 files, let me know, may be I can upload to a drop box and let you download, email me.

MTU Vidoe

Nested Loop Join Costing

The basic formula for calculating the costs of a Nested Loop Join is pretty straightforward and has been described and published several times.

In principle it is the cost of acquiring the driving row source plus the cost of acquiring the inner row source of the Nested Loop as many times as the driving row source dictates via the cardinality of the driving row source.

Cost (outer rowsource) + Cost (inner rowsource) * Card (outer rowsource)

Obviously there are cases where Oracle has introduced refinements to the above formula where this is no longer true. Here is one of these cases that is probably not uncommon.

Let's start with a simple two table join that shows above formula in action. It represents a parent-child relationship where the parent table has 10,000 rows with a unique identifier, and a child table with 100 rows each that map to a single parent row, having 1,000,000 rows in total.

set echo on create table t as select rownum as id , rownum as attr1 , rpad('x', 100) as filler from dual connect by level <= 10000 ; exec dbms_stats.gather_table_stats(null, 't') create table t2 as select rownum as id , mod(rownum, 10000) + 1 as fk , mod(rownum, 20) + 1 as attr1 , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't2') create index t2_idx on t2 (fk); explain plan for select /*+ use_nl(t t2) leading(t) index(t2) */ * from t , t2 where t.attr1 <= 500 and t2.fk = t.id; set pagesize 0 linesize 200 tab off select * from table(dbms_xplan.display);

which gives the following execution plan in 11.2:

--------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 50005 | 10M| 51087 (1)| 00:10:14 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 50005 | 10M| 51087 (1)| 00:10:14 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 100 | | 2 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 100 | 11300 | 102 (0)| 00:00:02 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

There are a couple of noteworthy comments:

1. T2.FK and T.ID have the same number of distinct keys each (10,000), so assuming corresponding primary and foreign key constraints in place this means that there is at least one matching row for every parent row.

2. There is a filter on T that restricts the driving row source to 500 rows. It is interesting to note that this filter results in a "biased join" where Oracle picks the NUM_DISTINCT from the other table for the join selectivity calculation rather than using the greatest NUM_DISTINCT of both, in this case arriving at a correct cardinality estimate of approx. 50,000

3. The T2_IDX index has a worst-case clustering factor due to the scattering of T2.FK via the MOD function, hence the resulting cost of a single iteration of the Nested Loop join is 100 for picking up 100 rows from the table (plus 2 for the index access)

The overall cost for the loop is 45 for acquiring the driving row source plus 500 times 102, the remaining part being CPU costing overhead. This matches above formula.

Now let's re-create table T2 using a different distribution of the T2.FK column:

drop table t2; purge table t2; create table t2 as select rownum as id , mod(rownum, 500) + 1 as fk , mod(rownum, 20) + 1 as attr1 , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't2') create index t2_idx on t2 (fk);

The only difference is now that the FK column no longer has 10,000 distinct keys but only 500.

Of course on average now 2,000 rows in T2 match a parent row in T, but obviously no longer all of them have a match in T2.

Let's review the execution plan for the same SQL statement as above:

--------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 998K| 211M| 52609 (1)| 00:10:32 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 998K| 211M| 52609 (1)| 00:10:32 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 2000 | | 5 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 1996 | 220K| 2007 (1)| 00:00:25 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

Since now 2,000 rows match a single parent key on average, the cost per iteration has increased accordingly. The index range scan needs to visit some root / branch blocks plus four leaf blocks on average, resulting in a cost of 5. The table access needs to visit 2,000 rows on average, hence with the same worst-case clustering factor the cost per iteration is 2,000 plus CPU overhead.

Given the fact that we need to do this 500 times the final cost of the Nested Loop join ought to be something close to one million, but surprisingly it is only a little bit higher than the cost of the previous scenario.

So clearly the original formula doesn't apply here, although the cost for a single iteration in the execution plan seems to match the expectations.

It looks like Oracle a long way back introduced a refinement to the original formula in the case of less distinct keys of the inner row source join column than the driving row source join column.

The idea behind it seems to be that this is what Oracle calls a "sparse" join. Obviously not every row from the driving row source will find a match in the inner row source, hence some of the loop iterations should end with the index lookup, not finding a match and therefore no need to visit the inner table afterwards. The refinement hence calculates a lower average cost of the table visit per loop iteration.

This is of course true when looking at the join by itself. But if you are unlucky and a corresponding filter on the driving row source turns the "sparse" join into a (rather) "dense" join, then this refinement can lead to a significant cost underestimate of a Nested Loop join, potentially driving the optimizer towards favouring that join method over others.

And it is exactly that scenario what above query simulates: The filter on T results in a full match of all 500 driving rows, and the actual cost of the Nested Loop join ought to be closer to one million than 50,000 in this case.

The refinement to the cost calculation seems to be based on the ratio between the NUM_DISTINCT of two join coin columns: In my example the ratio is 10,000:500, so the overall cost is only approx. 1/20 of the original cost.

There are some further details how the formula deals with a small number of loop iterations. For example, the first iteration will get the full original cost, a number of further iterations (again seems to correspond to the ratio, here for example 20) get only the index cost added, and from then on the costing of the table access levels at the original cost downscaled by the ratio (2000 / 20 which is 100 in this case).

The refinement obviously has been added in release 10.1.0.3 and can be found in the Fix Control list as bug number 3120429. The text for this fix is: "account for join key sparsity in computing NL index access cost" and apparently only applies if the inner row source uses an index access.

This also means that the original costing can be activated by disabling the fix:

explain plan for select /*+ opt_param('_fix_control' '3120429:0') use_nl(t t2) leading(t) index(t2) */ * from t , t2 where t.attr1 <= 500 and t2.fk = t.id; set pagesize 0 linesize 200 tab off select * from table(dbms_xplan.display); --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 998K| 211M| 1003K (1)| 03:20:41 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 998K| 211M| 1003K (1)| 03:20:41 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 2000 | | 5 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 1996 | 220K| 2007 (1)| 00:00:25 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

So if you happen to have such cases where the filter on a driving row source leads to an underestimate of the Nested Loop join cost as outlined here, using the fix control allows arriving at more reasonable costing figures.

I recently had such a case at a client where only a small part of a fully populated time dimension was used in the foreign key of a fact table, but the filter on the time dimension lead to exactly the scenario described here - the join wasn't really sparse and the Nested Loop join cost was significantly underestimated.

TKPROF Elapsed Time Challenge – the Elapsed Time is Half of the Wait Event Time

May 20, 2012 An interesting quirk was recently brought to my attention by Mich Talebzadeh.  He generated a 10046 trace at level 8 for a session, executed some SQL statements, disabled the trace, and then processed the resulting trace file with TKPROF.  His TKPROF output included the following: UPDATE TESTWRITES SET PADDING1 = RPAD('y',4000,'y') WHERE [...]

Repairman Jack : Harbingers

Harbingers is the tenth book in the Repairman Jack series by F. Paul Wilson.

The last of Jack’s relatives are gone. Are his girlfriend (Gia), her daughter (Vicky) and Jack’s unborn child the next in line? Is there anything Jack can do to protect them?

This book focuses more on Jack’s relationship to “The Otherness” and “The Ally”. We see a more aggressive side of Jack, as well as the cold calm detachment when he’s doing his job. Dark, dark, dark, but also kinda exciting.

Cheers

Tim…

 

 


Repairman Jack : Harbingers was first posted on May 19, 2012 at 3:16 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




Ad: Upcoming Events

I've built up quite a list of upcoming events that I want to blog about, so I'll deal with them in a single post (... then the OUGN review and then *finally*, *maybe* a technical post or three ...)

Chris Date Seminar
I think I've seen this mentioned in a few blog posts already, but Peter Robson is organising a 2-day seminar presented by relational database luminary, C. J. Date. I was lucky enough to attend a single day listening to Chris in Edinburgh a few years ago now and, whilst I might not have agreed with everything he said that day (so sue me), the clarity and force of his arguments and the intellectual rigour of the discussion was a contrast to the rather more lightweight content of most conference presentations. Definitely recommended and you can find more details here Throw in a copy of his latest book and, if I wasn't taking so many days off work at the moment anyway, I would definitely be there.

OUG Scotland
Speaking of too many days off work ... It's been too long since I've been to the OUG Scotland Conference - at least over the past 3 years that I've been working in London. Otherwise I would be there this year and it's probably about time I was going back next year. It's a cracking event. Free registration; free beer; an absolutely stellar agenda that the Scottish user group types like Thomas Presslie always pull together; and, of course, it's in the absolutely best part of the world ;-) Given that it costs nothing to attend, I think that anyone who can't find a way of getting a day out of the office to hear some fantastic speakers needs their head examined. (Oh, a little bit of local information. Oracle's place in Linlithgow - where I used to work for a while when it was Sun's place - might look a little out of the way but the train service to Edinburgh is regular, fast and excellent.

UKOUG Call For Papers
There's only a few weeks left to submit an abstract for one of the finest Oracle conferences - the UKOUG conference in (erm) sunny Birmingham. It was the first Oracle conference I presented at so, like Tom Kyte, I would encourage as many people to give it a try as possible (as I have with many people over the years). I've discussed this with a few people recently and they seem a little frightened to submit an presentation abstract for such a well-known conference with big-name speakers, but they seem to be forgetting something. UKOUG is a User Group. The whole point is for other users to hear your experiences and some of the best presentations I've attended have been from inexperienced speakers sharing their Real World experiences of using Oracle. Sure, I want to hear deeply technical presentations based on experimentation, but a user group conference is about so much more than that and I dread the day when I see an agenda filled with the same old speakers I've heard many times before. Consider this, too ... How did those speakers become experienced and well-known? We all have to start somewhere and UKOUG is as good as anywhere. Of course, having an abstract accepted can be tough, but unless you give it a try, how will you know?

Having been so positive about the conference over the years and encouraged people to present, I can't help being honest in saying that if UKOUG had stuck to their original idea of giving speakers a one-day pass to the conference, rather than the standard full conference pass (and when I say standard, I mean like every other conference does!) then I wouldn't have recommended it. Fortunately they've performed a U-turn on what I always considered a ridiculous decision :-( Personally, I found it a bit cheap and strangely hurtful so I'm still not sure whether I'll submit an abstract this year, but it shouldn't stop others and might lead to a wider and more varied agenda. Regardless, the agenda is always excellent.

Red Gate Seminars
There have been several times over the past few years when I've considered producing an online video of my OEM Performance Page presentations but it hasn't happened for various reasons. I recall Alex Gorbachev suggesting I repeat my (failed) Hotsos version for Pythian to post online but I never got round to it. So I was absolutely delighted when Red Gate Software asked me if I had any ideas for online webinars to supplement their existing series! I have two sessions planned in the near future. One based on the original presentation that covers Performance Page fundamentals and a second focussing more closely on the new OEM 12 features, including ASH Analytics.

You can see a list of upcoming seminars here. Even better, it includes archived copies of previous presentations by some of the best Oracle presenters around. Definitely worth a look. I'll post more details on by own webinars nearer the time.

Enkitec E4
Last, but definitely not least, my friends at Enkitec are organising an Exadata-focussed conference in Dallas - E4.

Again, although there are some pre-selected speakers, there is also an open Call For Papers so if you feel you have something interesting to say about Exadata, the opportunity is there. I'm definitely planning on going and will hopefully be giving a new Parallel Query presentation.

Force Matching Signature

It amazes and amuses me how many times I remember something that I used to know but forgot due to lack of use. I'm not sure if it's because I'm just getting older or what. :)

I had just created a SQL Profile and set the force_matching option to yes so that any SQL that had the same signature would be covered by this profile. If you need a reminder, what the force matching signature does is to treat literals in the SQL like binds (think along the lines of cursor_sharing = force). When force matching is in place, all the literals are treated as binds and the same signature is generated for any SQL that differs only by their literals (white space and case are handled already).

In my case, the force matching signature option works great when I'm working with SQL that has been generated from a reporting tool (like OBIEE) and is formulated with all literals. So, I had this gnarly OBIEE generated SQL statement that was running for over an hour that I needed to "fix". I could see what needed to happen to get a better plan, but I wasn't able to make changes to the model to get it to generate the SQL I wanted. However, I was able to produce a better plan by rewriting the SQL. Once I rewrote it, I decided to create a SQL Profile for the "bad" SQL and attach the "good" plan from my rewritten version of the query to it. I got the profile created and ran a test and everything looked good. The profile kicked in and the SQL used my "good" plan and it took just a few seconds to complete. Cool. All done, right? Well...not really.

A bit later, the same query was executed and instead of using the profile, it didn't and took over an hour again. Sigh...

What happened? Well, the SQL was the same except for a change to a literal. So, it's not really the same, but with force matching turned on, it should've been OK, right? That would've been true but, as it turned out, there was actually a single bind variable in the SQL statement in addition to all the literals. And, it was the presence of that single bind that caused the force matching signatures to be different and therefore my profile was ignored.

But why did the single bind variable mess the force matching signature up? The bind would be the same for every SQL statement, so shouldn't force matching simply convert the literals to binds and I'd be good to go? Something was tickling at the back of my brain and after a brief check of the documentation about force matching, I found this:

By setting force_match to TRUE, the SQL profile additionally targets all SQL statements that have the same text after normalizing literal values to bind variables. This setting may be useful for applications that use only literal values because it allows SQL with text differing only in its literal values to share a SQL profile. #f4cccc;">If both literal values and bind variables are in the SQL text, or if #f4cccc;">force_match#f4cccc;"> is set to #f4cccc;">FALSE#f4cccc;"> (default), then literal values are not normalized.

Ding, ding, ding. There it was. If both literals and binds are present, force matching won't kick in. Just to prove it to my brain, and hopefully make it stick a little better for next time, I did this simple test.

  1. Run two queries that differ only by a single literal.
  2. Verify that the force_matching_signature for both is the same.
  3. Add a predicate that uses a bind variable to each query.
  4. Execute each and check the signatures again.
SQL>variable v1 varchar2(10)
SQL>exec :v1 := 'Sunday';

PL/SQL procedure successfully completed.

SQL>
SQL>select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2011 ;

CT
--------------------
365

SQL>select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2012 ;

CT
--------------------
366

SQL>
SQL>select /* getfm */ sql_id, plan_hash_value, force_matching_signature, substr(sql_text,1,200) sql_text
2 from v$sql
3 where upper(sql_text) like '%KMFMTST00%'
4 and sql_text not like '%/* getfm */%' ;

SQL_ID PLAN_HASH_VALUE FORCE_MATCHING_SIGNATURE SQL_TEXT
------------- -------------------- ------------------------ -------------------------------------------------------------------------------
6sz5sqg1yu2u7 3996576519 9139782190997132164 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2011
88bgq57sjbtkt 3996576519 9139782190997132164 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2012

SQL>
SQL>select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2011 and mcal_day_name = :v1 ;

CT
--------------------
52

SQL>select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2012 and mcal_day_name = :v1 ;

CT
--------------------
53

SQL>
SQL>select /* getfm */ sql_id, plan_hash_value, force_matching_signature, substr(sql_text,1,200) sql_text
2 from v$sql
3 where upper(sql_text) like '%KMFMTST00%'
4 and sql_text not like '%/* getfm */%' ;

SQL_ID PLAN_HASH_VALUE FORCE_MATCHING_SIGNATURE SQL_TEXT
------------- -------------------- ------------------------ -------------------------------------------------------------------------------
6sz5sqg1yu2u7 3996576519 9139782190997132164 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2011
88bgq57sjbtkt 3996576519 9139782190997132164 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2012
48rxh2r545xqy 3996576519 5839486434578375421 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2011 and mcal_day_name = :v1
6q610fykrr4d3 3996576519 8791659410855071518 select /* kmfmtst00 */ count(*) ct from km1 where mcal_year = 2012 and mcal_day_name = :v1

As you can see, the first two statements that differed by just a single literal shared the same signature. But, for the two statements that had the bind added, the signatures were different. That's exactly what happened to me for the SQL profile I had created. When I ran my test after creating it, I didn't check the query execution for different literal values so I didn't catch this.

So, in the end, I remembered something I forgot about force matching *and* I got a good mental slap about thorough testing. I assumed I knew how something was going to work and didn't use due diligence to test and confirm. Want to bet that next time I'll remember? I can't be 100% certain, but I'd say the odds are in favor that I will.  :)

Index Sizing

I was in a discussion recently about how to estimate the size of a bitmap index before you build it, and why it’s much harder to do this for bitmap indexes than it is for B-tree indexes. Here’s what I wrote in “Practical Oracle 8i”:

An interesting feature of bitmap indexes is that it is rather hard to predict how large the index segment will be. The size of a b-tree index is based very closely on the number of rows and the typical size of the entries in the index column.  The size of a bitmap index is dictated by a fairly small number of bit-strings which may have been compressed to some degree depending upon the number of consecutive 1’s and 0’s.   To pick an extreme example, imagine a table of one million rows that has one column that may contain one of eight values ‘A’ to ‘H’ say, which has been generated in one of of the two following extreme patterns:

  • All the rows for a given value appear together, so scanning down the table we get 125,000 rows with ‘A’ followed by 125,000 rows of ‘B’ and so on.
  • The rows cycle through the values in turn, so scanning down the table we get ‘A’,’B’. . . ‘H’ repeated 125,000 times.  

What will the bitmap indexes look like in the two cases?  

For the first example, the basic map for the ‘A’ value will be 125,000 one-bits, followed by 875,000 zero bits – which will be trimmed off.  Splitting the 125,000 bits into bytes and adding the necessary overhead of about 12% we get an entry for the ‘A’ rows of 18K.  A similar argument applies for each of the values ‘B’ to ‘H’, so we get a total index size of around 8 x 18K – giving 156K.  

For the second example, the basic map for the ‘A’ value will be a one followed by 7 zeros, repeated 125,000 times.  There is no chance of compression here, so the ‘A’ entry will start at 125,000 bytes.  Adding the overhead this goes up to 140K, and repeating the argument for the values ‘B’ to ‘H’ we get a total index of 1.12 MB.  

This wild variation in size looks like a threat, but to put this into perspective, a standard B-tree index on this column would run to about 12 Mb irrespective of the pattern of the data.  It would probably take about ten times as long to build as well.

I wrote up a test case to recreate this model some time ago, so here it is with the results from an instance of 11.1.0.7:

create table t1
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		level <= 10000
)
select
	chr(65 + mod(rownum-1,8))		bit_scattered,
	chr(65+trunc((rownum-1)/125000))	bit_clustered
from
	generator	v1,
	generator	v2
where
	rownum <= 1e6
;

create bitmap index t1_b1_scattered on t1(bit_scattered);
create bitmap index t1_b1_clustered on t1(bit_clustered);

select
	index_name, leaf_blocks, 8 * leaf_blocks KB
from
	user_indexes
where
	table_name = 'T1'
;

set doc off
doc

Results from 11.1.0.7
---------------------

INDEX_NAME           LEAF_BLOCKS         KB
-------------------- ----------- ----------
T1_B1_SCATTERED              164       1312
T1_B1_CLUSTERED               24        192

2 rows selected.
#

So, no big change there, then.
If you modify the code to create B-tree indexes you’ll find they are 14MB each if you don’t use compression, 12MB if you do.

SLOB Is Not An Unrealistic Platform Performance Measurement Tool – Part I. Let’s See If That Matters…To Anyone.

I just checked to find out that there has been 3,000 downloads of SLOB – The Silly Little Benchmark. People seem to be putting it to good use. That’s good.

Before I get very far in this post I’d like to take us back in time–back before the smashing popularity of the Orion I/O testing tool.

When Orion first appeared on the scene there was a general reluctance to adopt it. I suspect some of the reluctance stemmed from the fact that folks had built up their reliance on other tools like bonnie, LMbench, vxbench and other such generic I/O generators. Back in the 2006 (or so) time frame I routinely pointed out that no tool other than Orion used the VOS layer Oracle I/O routines and libraries. It’s important to test as much of the real thing as possible.

Who wants to rely on an unrealistic platform performance measurement tool after all?

My “List”
Over time I built a list of reasons I could no longer accept Orion as sufficient for platform I/O testing. Please note, I just wrote “platform I/O testing” not “I/O subsystem testing.”  I think the rest of this post will make the distinction between these two quoted phrases quite clear. The following is a short version of the list:

  • Orion does not simulate Oracle processing in any way, shape or form. More on that as this blog series matures.
  • Orion is what I refer to as mindless I/O. More on that as this blog series matures.
  • Orion is useless in assessing a platform’s capability to handle modify-intensive DML (thus REDO processing, LGWR and DBWR, etc). More on that as this blog series matures.

My present-tense views on Orion sometimes surface on twitter where I am occasionally met with vigorous disagreement–most notably from my friend Alex Gorbachev. Alex is a friend, co-member of the Oaktable Network, CTO of Pythian (I love those Pythian folks), and someone who generally disagrees with most everything I say.

I respect Alex, because he has vast knowledge and valuable skills. His arguments make me think. That’s a good thing. I’m not sure, however, our respective spheres of expertise overlap.

So how do these disagreements regarding SLOB get started? Recently I tweeted:

The difference between SLOB and Orion is akin to Elliptical trainer versus skiing on the side of a mountain.

Alex replied with:

I could just as well argue that SLOB is useless because that’s not real workload anyway and you should test with your app

This quick exchange of ideas set into motion some Pythian testing by Yury. As it turns out I think the goal of that test was to prove parity between SLOB and Orion for random reads–and perhaps not much more.  If only I have published “My List” above before then.

Yury’s tests were good, albeit, exceedingly small in scope. His blog post suggests more testing on the way. That is good. If you read the comment thread on his blog entry you’ll see where I thank Yury for a good tweak to the SLOB kit that eliminates the db file parallel reads associated with the index range scans incurred by SLOB reader processes. Come to think of it though, Matt from Violin Memory pointed that one out to me some time back. Hmm, oh well, I digress. The modifications Yury detailed (init.ora parameters) will be included in the next drop of the SLOB kit. Again, thanks Yury for the testing and the init.ora parameter change recommendations!

Feel free to see Yury’s findings. They are simple: SLOB and Orion do the same thing. Really, SLOB and Orion do the same thing? Well, that may be the case so long as a) you compare SLOB to Orion only for simple random read testing and/or b) your testing is limited to a little, itsy-bitsy, teeny, tiny, teensy, minute, miniscule, meager, puny, Lilliputian-grade undersized I/O subsystem incapable of producing reasonable, modern-scale IOPS.  Yury’s experiment topped out at roughly 4,500 random read IOPS.  I’ll try to convince you that there is more to it than that (hint, modern servers are fit for IOPS in the 20,000/core range). But first, I have two quotable quotes to offer at this point:

When assessing the validity of an I/O testing tool, do so on a system that isn’t badly bottlenecked on storage.

If your application (e.g., Oracle Database)  is “mindless” use a “mindless” I/O generator–if not, don’t.

Mindless I/O
So what do I mean when I say “mindless I/O?”  The answer to that is simple. If the code performs an I/O into a memory buffer, without any application concurrency overhead, and no processes even so much as peeks at a single byte of that buffer populated through DMA from the I/O adapter device driver–it’s mindless. That is exactly how Orion does what it does. That’s what every other synthetic I/O generator I know of does as well.

So what does mindless I/O look like and why does it show up on my personal radar as a problem? Let’s take a look–but first let me just say one thing–I analyze I/O characteristics on extremely I/O capable platforms. Extremely capable.

The following screen shot shows a dd(1) command performing mindless I/O by copying an Oracle OMF datafile from an XFS file system to /dev/null using direct I/O. After that another dd(1) was used to show the difference between “mindless” and meaningful I/O. The second dd(1) was meaningful because after each 1 MB read the buffer is scanned looking for lower case ASCII chars to convert to their upper-case counterpart. That is, the second dd(1) did data processing–not just a mindless tickling of the I/O subsystem.

The mindless I/O was 2.5 GB/s but the meaningful case fell to about 1/6th that at 399 MB/s. See, CPU matters. It matters in I/O testing. CPU throttles I/O–unless you are interested in mindless I/O. What does this have to do with Orion and SLOB? A moment ago I mentioned that I test very formidable I/O subsystems commensurate with modern platforms–so hold on to your hat while I tie these trains of thought together.

Building on my dd(1) example of mindless I/O, I’ll offer the following screen shot which shows Orion accessing the same OMF SLOB datafile (also via direct I/O validated with strace). Notice how I force all the threads of Orion (it’s threaded with libpthreads) to OS CPU 0 using numactl(8) on this 2s12c24t Xeon 5600 server?  What you are about to see is the single-core capacity of Orion to perform “mindless I/O”:

Unrealistic Platform Performance Measurement Tools
This is only Part I in this series.  I’ll be going through a lot of proof points to solidify backing for my Orion-related assertions in the list above, but please humor me for a moment. I’d like to know just how realistic are platform performance measurements from an I/O tool that demonstrates capacity for 144,339 physical 8K random IOPS while pinned to a single core of a Xeon 5600 processor?

We are interested in database platform IOPS capacity, right?

Through this blog series I aim to help you conclude that any tool demonstrating such an unrealistic platform performance measurement is, well, an unrealistic platform performance measurement tool.

Do you feel comfortable relying on an unrealistic platform performance measurement tool? Before I crafted SLOB I too accepted test results from unrealistic platform performance measurement tools but I learned that I needed to include the rest of the platform (e.g., CPU, bus, etc) when I’m studying platform performance so I left behind unrealistic platform performance measurement tools.

Until recently I didn’t spend any time discussing measurements taken from unrealistic platform performance measurement tools. However, since friends and others in social media are pitting unrealistic platform performance measurement tools against SLOB (not an unrealistic platform performance measurement tool) such comparisons are blog-worthy. Hence, I’ll trudge forward blogging about how unrealistic certain unrealistic platform performance measurement tools are. And, if you stay with me on the series, you might discover some things you don’t know because, perhaps, you’ve been relying on unrealistic platform performance measurement tools.

As this series evolves, I’ll be sharing several similar unrealistic platform performance measurement tool results as I go though the list above. That is, of course, what motivated me to leave behind unrealistic platform performance measurement tools.

Final Words For This Installment
In Yury’s post he quoted me as having said:

It’s VERY easy to get huge Orion nums

His assessment of that quote was, “kind of FALSE on this occasion.”  Having now shown what I mean by “VERY easy” (e.g., even a single core can drive massive Orion IOPS) and “huge Orion” numbers  (e.g., 144K IOPS), I wonder whether Yury will be convinced about my assertions regarding unrealistic platform performance measurement tools? If not yet, perhaps he, and other readers will eventually. After all, this is only Part I. If not, Yury, I still want to say, “thanks for testing with SLOB and please keep the feedback coming.”

Alex and I may always disagree :-)

Oh, by the way folks, if all you have is Orion, use it. It is better than wild guesses–at least a little better.

Filed under: oracle