Here’s a note I’ve been meaning to research and write up for more than 18 months – every since Dion Cho pinged a note I’d written about the effects of partitioning because of a comment it made about the “2% small table threshold”.
It has long been an item of common knowledge that Oracle has a “small table threshold” that allows for special treatment of data segments that are smaller than two percent of the size of the buffer cache, viz:
If a table is shorter than the threshold then it is read to the midpoint of the cache (just like any other block read) but – whether by accident or design – the touch count (x$bh.tch) is not set and the table will fall off the LRU end of the buffer cache fairly promptly as other objects are read into the buffer. Such a tablescan would be recorded under the statistic “table scans (short tables)”.
Then, in July 2009, Dion Cho decided to check this description before repeating it, and set about testing it on Oracle 10gR2 – producing some surprising results and adding another item to my to-do list. Since then I have wanted to check his conclusions, check whether the original description had ever been true and when (or if) it had changed.
As a simple starting point, of course, it was easy to check the description of the relevant (hidden) parameter to see when it changed:
188.8.131.52 _small_table_threshold threshold level of table size for forget-bit enabled during scan 184.108.40.206 _small_table_threshold threshold level of table size for direct reads 220.127.116.11 _small_table_threshold lower threshold level of table size for direct reads
This suggests that the behaviour might have changed some time in 9i (18.104.22.168 happened to be the earliest 9i listing of x$ksppi I had on file) – so I clearly had at least three major versions to check.
The behaviour of the cache isn’t an easy thing to test, though, because there are a number of special cases to consider – in particular the results could be affected by the positioning of the “mid-point” marker (x$kcbwds.cold_hd) that separates the “cold” buffers from the “hot” buffers. By default the hot portion of the default buffer is 50% of the total cache (set by hidden parameter _db_percent_hot_default) but on instance startup or after a “flush buffer cache” there are no used buffers so the behaviour can show some anomalies.
So here’s the basic strategy:
Here’s some sample code:
create table t_15400 pctfree 99 pctused 1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 15400 ; create index t_15400_id on t_15400(id); begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T_15400', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select object_name, object_id, data_object_id from user_objects where object_name in ( 'T_300', 'T_770', 'T_1540', 'T_3750', 'T_7700', 'T_15400', 'T_15400_ID' ) order by object_id ; select /*+ index(t) */ max(small_vc) from t_15400 t where id > 0 ;
The extract shows the creation of just the last and largest table I created and collected statistics for – and it was the only one with an index. I chose the number of blocks (I’ve rigged one row per block) because I had set up a db_cache_size of 128MB on my 10.2.0.3 Oracle instance and this had given me 15,460 buffers.
As you can see from the query against user_objects my test case included tables with 7,700 rows (50%), 3,750 rows (25%), 1,540 rows (10%), 770 rows (5%) and 300 rows (2%). (The number in brackets are the approximate sizes of the tables – all slightly undersized – relative to the number of buffers in the default cache).
Here’s the query that I then ran against x$bh (connected as sys from another session) to see what was in the cache (the range of values needs to be adjusted to cover the range of object_id reported from user_objects):
select obj, tch, count(*) from x$bh where obj between 77710 and 77720 group by obj, tch order by count(*) ;
After executing the first index range scan of t_15400 to fill the cache three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75855 0 1 75854 0 1 75853 0 1 75851 0 1 75850 0 1 75849 0 1 75852 0 1 75855 2 9 -- Index blocks, touch count incremented 75855 1 18 -- Index blocks, touch count incremented 75854 1 11521 -- Table blocks, touch count incremented
Then after three tablescans, at 4 second intervals, of the 7,700 block table:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 -- segment header of 7700 table, touch count incremented each time 75855 0 1 75854 0 1 75852 0 1 75849 0 1 75850 0 1 75851 0 1 75855 2 9 75855 1 10 75853 0 3991 -- lots of blocks from 7700 table, no touch count increment 75854 1 7538
Then repeating the tablescan of the 3,750 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 0 1 75852 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75855 2 9 75855 1 10 75853 0 240 75852 0 3750 -- table completely cached - touch count not incremented 75854 1 7538
Then repeating the tablescan of the 1,540 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75851 3 1 -- segment header block, touch count incremented each time 75849 0 1 75850 0 1 75852 3 1 75855 2 9 75855 1 10 75853 0 149 75851 2 1540 -- Table fully cached, touch count incremented but only to 2 75852 0 2430 75854 1 7538
Then executing the tablescan of the 770 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75850 3 1 -- segment header block, touch count incremented each time 75849 0 1 75851 3 1 75852 3 1 75854 0 1 75855 2 9 75855 1 10 75851 0 69 75853 0 149 75850 2 770 -- Table fully cached, touch count incremented but only to 2 75851 2 1471 75852 0 1642 75854 1 7538
Finally executing the tablescan of the 300 block table three times:
OBJ TCH COUNT(*) ---------- ---------- ---------- 75853 3 1 75855 0 1 75854 0 1 75850 3 1 75852 3 1 75851 3 1 75855 2 9 75855 1 10 75851 0 69 75850 0 131 75853 0 149 75849 3 301 -- Table, and segment header, cached and touch count incremented 3 times 75850 2 639 75852 0 1342 75851 2 1471 75854 1 7538
This set of results on its own isn’t conclusive, of course, but the indications for 10.2.0.3 are:
I can’t state with any certainty where the used and recycled buffers might be, but since blocks from the 3750 tablescan removed the blocks from the 7700 tablescan, it’s possible that “large” tablescans do somehow go “to the bottom quarter” of the LRU.
There also some benefit in checking the statistics “table scans (short)” and “table scans (long)” as the tests run. For the 2% (300 block) table I recorded 3 short tablescans; for the tables in the 2% to 10% range (770 and 1540) I recorded one long and two short (which is consistent with the touch count increment of 2 – the first scan was expected to be long, but the 2nd and 3rd were deemed to be short based on some internal algorithm about the tables being fully cached); finally for the tables above 10% we always got 3 long tablescans.
But as it says in the original note on small partitions – there are plenty of questions still to answer:
I’ve quoted the 2% as the fraction of the db_cache_size – but we have automatic SGA management in 10g, automatic memory management in 11g, and up to eight different cache sizing parameters in every version from 9i onwards. What figure is used as the basis for the 2%, and is that 2% of the blocks or 2% of the bytes, and if you have multiple block sizes does each cache perhaps allow 2% of its own size.
And then, in 11g we have to worry about automatic direct path serial tablescans – and it would be easy to think that the “_small_table_threshold” may have been describing that feature since (at least) 22.214.171.124 if its description hadn’t changed slightly for 11.2 !
So much to do, so little time — but at least you know that there’s something that needs careful investigation if you’re planning to do lots of tablescans.
Footnote: Having written some tests, it’s easy to change versions. Running on 126.96.36.199 and 188.8.131.52, with similar sized caches, I could see that the “traditional” description of the “small_table_threshold” was true – a short tablescan was anything less 2% of the buffer cache, long tablescans were (in effect) done using just a window of “db_file_multiblock_read_count” buffers, and in both cases the touch count was never set (except for the segment header block).
A lot of people have asked me whether there’s some sort of index or “table of contents” of my TPT scripts (there’s over 500 scripts in the tpt_public.zip file – http://tech.e2sn.com/oracle-scripts-and-tools )
I have planned to create such index for years, but never got to it. I probably never will :) So a good way to extract the descriptions of some scripts is this (run the command in the directory where you extracted my scripts to):
tech.E2SN secret hacking session on Tuesday 22nd March:
Just in case you missed it – there’s still chance to sign up to my tomorrow’s ORA-4031 and shared pool hacking session. I initially planned to limit the attendees to 100 per event (as the limited GotoWebinar package is cheaper that way) but over 100 people had signed up for the US event on the day of announcement, even before it was 8am in California, so I figured I should invest a bit more and allow more people attend. So far over 500 people have signed up (total for both events). If you haven’t done so, you can sign up here:
Advanced Oracle Troubleshooting online seminar Deep Dives 1-5 on 11-15 April:
The next AOT deep dives (1-5) will start in 3 weeks, on 11-15 April. (and 6-10 will be on 9-13 May).
Check the details here:
Blogs to check out:
Andrey Nikolaev has done some serious low-level research on Oracle latches and KGX mutexes and he also presented his work this year at Hotsos Symposium (I missed his session as I was stuck in JFK instead of attending the conference on that day):
Porus Havewala is quite a Grid Control and OEM enthusiast. If you are into OEM & GC, check out his blog:
(The title’s a pun, by the way – an English form of humour that is not considered good unless it’s really bad.)
Very few people try to email me or call me with private problems – which is the way it should be, and I am grateful to my audience for realizing that this blog isn’t trying to compete with AskTom – but I do get the occasional communication and sometimes it’s an interesting oddity that’s worth a little time.
Today’s blog item is one such oddity – it was a surprise, it looked like a nasty change in behaviour, and it came complete with a description of environment, and a neatly formatted, complete, demonstration. For a discussion of the problem in Spanish you can visit the blog of John Ospino Rivas, who sent me the original email and has written his own blog post on the problem.
We start with a simple table, and then query it with a ‘select for update‘ from two different sessions:
drop table tab1 purge; create table tab1( id number, info varchar2(10), constraint tab1_pk primary key (id) using index (create index idx_tab1_pk on tab1(id)) ); insert into tab1 values(1,'a'); insert into tab1 values(2,'a'); insert into tab1 values(3,'a'); commit; execute dbms_stats.gather_table_stats(user,'tab1',cascade=>true) column id new_value m_id set autotrace on explain select id from tab1 where id = ( select min(id) from tab1 ) for update ; set autotrace off prompt ============================================================= prompt Now repeat the query in another session and watch it lock prompt And use a third session to check v$lock prompt Then delete here, commit and see what the second session does prompt ============================================================= accept X prompt 'Press return to delete and commit' set verify on delete from tab1 where id = &m_id; commit;
The fact that the primary key index is created as a non-unique index isn’t a factor that affects this demonstration.
Given the query and the data in the table, you won’t be surprised by the result of the query from the first session (for convenience I’ve captured the selected value using the ‘column new_value’ option). Here’s the result of the query and its execution plan:
ID ---------- 1 -------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3 | 1 (0)| 00:00:01 | | 1 | FOR UPDATE | | | | | | |* 2 | INDEX RANGE SCAN | IDX_TAB1_PK | 1 | 3 | 0 (0)| 00:00:01 | | 3 | SORT AGGREGATE | | 1 | 3 | | | | 4 | INDEX FULL SCAN (MIN/MAX)| IDX_TAB1_PK | 3 | 9 | 1 (0)| 00:00:01 | -------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ID"= (SELECT MIN("ID") FROM "TAB1" "TAB1"))
At this point the program issues instructions to repeat the query from a second session, then waits for you to press Return. When you run the same query from another session it’s going to see the data in read-consistent mode and try to select and lock the row where ID = 1, so the second session is going to hang waiting for the first session to commit or rollback.
Here’s the key question: what’s the second session going to return when you allow the first session to continue, delete the row it has selected, and commit ? Here’s the answer if you’re running 10.2.0.3 or 184.108.40.206 (which is what I happen to have easily available):
SQL> select id 2 from tab1 3 where id = ( 4 select min(id) 5 from tab1 6 ) 7 for update 8 ; ID ---------- 2 1 row selected.
Now, this seems perfectly reasonable to me – especially since I’ve read Tom Kyte’s notes on “write consistency” and seen the “rollback and restart” mechanism that kicks in when updates have to deal with data that’s changed since the start of the update. Session 2 had a (select for) update, and when it finally got to a point where it could lock the data it found that the read-consistent version of the data didn’t match the current version of the data so it restarted the statement at a new SCN. At the new SCN the current highest value was 2.
Now here’s what happened when I ran the test under 220.127.116.11:
SQL> select id 2 from tab1 3 where id = ( 4 select min(id) 5 from tab1 6 ) 7 for update 8 ; no rows selected
The upgrade produces a different answer !
At first sight (or guess) it looks as if the query has run in two parts – the first part producing the min(id) of 1 using a read-consistent query block, with the second part then using the resulting “known value” to execute the outer select (shades of “precompute_subquery”) and restarting only the second part when it discovers that the row it has been waiting for has gone away.
It doesn’t really matter whether you think the old behaviour or the new behaviour is correct – the problem is that the behaviour has changed in a way that could silently produce unexpected results. Be careful if any of your code uses select for update with subqueries.
As a defensive measure you might want to change the code to use the serializable isolation level – that way the upgraded code will crash with Oracle error ORA-08177 instead of silently giving different answers:
SQL> alter session set isolation_level = serializable; Session altered. SQL> get afiedt.buf 1 select /*+ gather_plan_statistics */ 2 id 3 from tab1 4 where id = ( 5 select min(id) 6 from tab1 7 ) 8* for update 9 / from tab1 * ERROR at line 3: ORA-08177: can't serialize access for this transaction
It might be a way of avoiding this specific problem, of course, but it’s not a frequently used feature (the first pages of hits on Google are mostly about SQL Server) so who knows what other anomalies this change in isolation level might introduce.
A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:
*** 2011-03-17 14:34:37.242 WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725 WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368 WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865 [...snipped...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205 WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221 WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560 WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579 WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685 WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706 WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720 #ff0000;">LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735
In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.
Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).
So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.
Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):
SQL> @oerr 10051 ORA-10051: trace OPI calls SQL> alter session set events '10051 trace name context forever, level 1'; Session altered.
Now run some LOB access code and check the tracefile:
*** 2011-03-17 14:37:07.178 WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602 OPI CALL: type=105 argc= 2 cursor= 0 name=Cursor close all CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731 OPI CALL: type=94 argc=28 cursor= 0 name=V8 Bundled Exec ===================== PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82' select * from t_lob END OF STMT PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805 EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884 WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939 WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785 OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945 FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963 WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629 #ff0000;">OPI CALL: type=96 argc=21 cursor= 0 name=#ff0000;">LOB/FILE operations WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788 [...snip...] WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054 LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064
Check the bold and especially the red string above. Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…
By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?
Well, in 18.104.22.168 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.
Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.
It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:
If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:
Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:
This is part 2 of an article on the KEEP cache. If you haven’t got here from part one you should read that first for an explanation of the STATE and CUR columns of the output.
Here’s a little code to demonstrate some of the problems with setting a KEEP cache – I’ve set up a 16MB cache, which gives me 1,996 buffers of 8KB in 10.2.0.3, and then created a table that doesn’t quite fill that cache. The table is 1,900 data blocks plus one block for the segment header (I’ve used freelist management to make the test as predictable as possible, and fixed the pctfree to get one row per block).
create table t1 pctfree 90 pctused 10 storage (buffer_pool keep) as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',1000) padding from generator v1, generator v2 where rownum <= 1900 ; alter system flush buffer_cache; -- scan the table to load it into memory select /*+ full(t1) */ count(small_vc) from t1 where id > 0 ; -- check the content of x$bh from another connection -- update every fifth row (380 in total) update t1 set small_vc = upper(small_vc) where mod(id,5) = 0 ; -- check the content of x$bh from another connection
The query I ran under the SYS account was this:
select bpd.bp_blksz, bpd.bp_name, wds.set_id, bh.state, cur, ct from ( select set_ds, state, bitand(flag, power(2,13)) cur, count(*) ct from x$bh group by set_ds, state, bitand(flag, power(2,13)) ) bh, x$kcbwds wds, x$kcbwbpd bpd where wds.addr = bh.set_ds and bpd.bp_lo_sid <= wds.set_id and bpd.bp_hi_sid >= wds.set_id and bpd.bp_size != 0 order by bpd.bp_blksz, bpd.bp_name, wds.set_id, bh.state, bh.cur ;
In my test case this produced two sets of figures, one for the DEFAULT cache, and one for the KEEP cache but I’ve only copied out the results from the KEEP cache, first after the initial tablescan, then after the update that affected 380 blocks:
BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 0 0 95 1 0 1901 ******* ********** ---------- sum 1996 BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 1 0 1462 1 8192 380 3 0 323 ******* ********** ---------- sum 1996
In the first output you see the 1901 buffers holding blocks from the table (1,900 data plus one segment header), with the remaining 95 buffers still “free” (state 0). The table blocks are all shown as XCUR (state 1, exclusive current)
In the second output you see 380 buffers holding blocks with state ‘XCUR’ with bit 13 of the flag column set, i.e. “gotten in current mode”. These are the 380 blocks that have been updated – but there are also 323 blocks shown as CR (state 3, “only valid for consistent read”). A detailed check of the file# and dbablk for these buffers shows that they are clones of (most of) the 380 blocks in the XCUR buffers.
Do a bit of arithmetic – we have 1462 blocks left from the original tablescan, plus 380 blocks in CUR mode (of which there are 323 clones) for a total of 1,842 blocks – which means that 59 blocks from the table are no longer in the cache. As we clone blocks we can lose some of the blocks we want to KEEP.
Unfortunately for us, Oracle has not given any preferential treatment to buffers which hold blocks in the XCUR state – any buffer which reaches the end of LRU chain and hasn’t been accessed since it was first loaded will be dumped so that the buffer can be used to create a clone (but see footnote). This means that a constant stream of inserts, updates, deletes, and queries could result in lots of clones being created in your KEEP cache, pushing out the data you want to keep.
If you want to size your KEEP cache to minimise this effect, you probably need to start by making it somewhat larger than the objects it is supposed to KEEP, and then checking to see how many clones you have in the cache – because that will give you an idea of how many extra buffers you need to stop the clones from pushing out the important data.
When I wrote and ran the test cases in this note the client was running Oracle 10.2 – while writing up my notes I happened to run the test on 22.214.171.124 (still using freelists rather than ASSM) and got the following output from my scan of the KEEP cache:
BP_BLKSZ BP_NAME SET_ID STATE CUR CT ---------- ------- ---------- ----- ----- ---------- 8192 KEEP 1 1 0 1901 3 0 91 ******* ********** ---------- sum 1992
Apart from the fact that you get slightly fewer buffers per granule in 11g (the x$bh structure has become slightly larger – and x$bh is a segmented array where each segment shares the granule with the buffers it points to) you can see that we only have 91 clones in the KEEP cache, and apparently we’ve managed to update our 380 blocks without changing their flag to “gotten in current mode”. Doing an update is, of course, just one way of making clones appear – but perhaps 11g will generally have more success in keeping current versions of blocks in memory than earlier versions.
There is , unfortunately, a very special feature to this test case – it’s using a single tablescan to update the table. So having said in part 1 that I was going to write a two-part article, I’ve got to this point, done a few more tests, and decided I need to write part three as well. Stay tuned.
Footnote: Technically there are a couple of circumstances where Oracle will bypass the buffer and walk along the LRU chain looking for another block – but I’ve excluded them from this demonstration.
Free stuff! Free stuff! Free stuff! :-)
The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)
On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).
I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.
The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!
As my frequent readers know, I have promised to not travel anymore as it’s just too much hassle compared to the benefit of being “there”. This is why I’m going to fly to London on Monday, 18th April to speak at the UKOUG Exadata Special Event. This event is just too sexy to be missed, so I made an exception (the last one, I promise!)… and it’s probably going to be warmer there as well compared to where I am now :-)
I will be talking about what’s been my focus area for last year or so – Oracle Exadata Performance.
Dan Norris and Alex Gorbachev will be speaking there too, so it should end up being a pretty awesome event!
More details here:
My abstract is following:
In order to systematically troubleshoot and optimize Exadata performance, one must understand the meaning of its performance metrics.
This session provides a deep technical walkthrough of how Exadata IO and smart scans work and how to use relevant metrics for troubleshooting related performance issues. We will review both Exadata database and cell-level metrics, cell wait events and tools useful for troubleshooting. We will also look into metrics related to Exadata Hybrid Columnar Compression and the cell Flash Cache usage.
P.S. The reason why I called this post “Exadata Training” is that you’ll learn some real world practical stuff there… as opposed to the marketing material (and marketing material copy material) overdose out there… ;-)
Here’s a bit of geek stuff that I’ve been meaning to write up for nearly a year – to the day, more or less – and I’ve finally been prompted to finish the job off by the re-appearance on the OTN database forum of the standard “keep cache” question:
This is a two-part note – and in the first part I’m just going to run a query and talk about the results. The query is one that has to be run by SYS because it references a couple of x$ structures, and this particular version of the query was engineered specifically for a particular client.
select obj, state, bitand(bh.flag,power(2,13)) cur, count(*) ct from x$bh bh, x$kcbwds wds where wds.addr = bh.set_ds and wds.set_id between 1 and 24 group by obj, state, bitand(bh.flag,power(2,13)) order by obj, state, bitand(bh.flag,power(2,13)) ;
You’ll notice I’m joining x$bh (the “buffer header” array) to x$kcbwds (the “working data set” array) where I’ve picked sets 1 to 24. On this particular system these were the sets for the KEEP cache. (If you want a generic query to isolate a particular cache then there’s an example here that identifies the RECYCLE cache by reference – but I wanted the query in this note to run as efficiently as possible against this production system, so I did a preliminary lookup against x$kcbwbpd and then used the literal set ids).
Here are a few lines from the resulting output:
OBJ STATE CUR CT ---------- ---------- ---------- ---------- 40158 1 8192 1 40189 1 0 87233 8192 272789 3 0 69804 8192 393868 40192 1 0 87 8192 12197 3 0 30763 8192 1994 ... 117291 1 0 498 8192 4419 3 0 3001 8192 15 117294 1 0 243 8192 3544 3 0 1245 8192 23 4294967294 3 0 2 ********** ---------- sum 1216072
Since we’re looking at x$ structures – which rarely have any official documentation – the rest of this note isn’t guaranteed to be correct – and things do change with version so I need to stress that this specific example comes from 126.96.36.199. This is what I think the results show:
The state column is instance-related and is essentially something that’s useful in a RAC enviroment. State 1 translates to ‘XCUR’ (exclusive current) which means that this instance has exclusive rights to the most recent version of the block; state 3 translates to ‘CR’ (only valid for consistent read).
Bit 13 of the flag column is set if the buffer has been “gotten in current mode”. (If you’re interested in the other bits there’s a page on my old website that might keep you entertained for a while – I haven’t yet updated it to 11g, though.)
The problem for the client was this – the total size of all the data segments in the KEEP cache was about 6GB and the total size of the KEEP cache was about 10GB, yet the database was still reporting a constant trickle of physical reads to the objects and, when the code to “re-load” the cache was executed at a quiet period at the start of the day some 60,000 physical blocks had to be read. With a 10GB cache for 6GB of data would you really expect to see this I/O ?
Take a look at the figures for object 40189:
There are 272,789 buffers for blocks that were “gotten in current mode” (bit 13 is set) and are also “exclusive current” (state 1) to the instance, but there are also 393,868 buffers that were originally “gotten in current mode” but are now “only valid for consistent read”.
Similarly there are 87,233 buffers for blocks that weren’t “gotten in current mode” but are “exclusive current” to the instance – in other words they are the most up to date version of the block but weren’t fetched with a “db block get”, and again there are 69,804 buffers holding blocks that were not “gotten in current mode” but which are now “only valid for consistent read”.
Buffers that are “only valid for consistent read” are buffers holding blocks that have been generated through one of Oracle’s mechanisms for creating CR (consistent read) clones. As you can see, then, a block that is a CR clone may still be flagged as “gotten in current mode”. In fact, in line with Oracle’s generally “lazy” approach to work you can even find (in some versions of Oracle, at least) CR clones that still have the “dirty” bit set in the flag, even though CR clones can never really be dirty and are NEVER written to disc.
Take another look at the buffer counts – this KEEP cache is sized at 1.2M buffers (10GB), but object 40189 alone has taken out 460,000 of those buffers (3.6GB) in block clones, and for this object there are more clones than originals (at 360,000, which happens to be just a few thousand blocks less than the size of the table). So, when you’re thinking about creating a KEEP cache, remember that you have to allow for block cloning – simply setting the db_keep_cache_size to something “a bit bigger” than the object you want to keep cached may not even be close to adequate.
Part 2 to follow soon.
For all that loves Oracle performance.. Hotsos is truly the best conference, all the speakers are performance geeks, all the attendees talks about performance. Everything is about performance! There are lot more stuff that I like about my first Hotsos experience, the following are some of them..
I like that presenters and attendees are curious about what each performance geek has to say..
I like that whenever their brains are already fried. They consume as much coffee/soda as they can.. and just lay their butts on this couch.. and still.. talk about performance.