Here’s a little quiz: If I take the average row length of the rows in a table, multiply by the number of rows, and convert the result to the equivalent number of blocks, how can the total volume of data in the table be greater than the total number of blocks below the table high water mark ? I’ve got three tables in a schema, and they’re all in the same (8KB block, 1M uniform extent, locally managed) tablespace, but here’s a query, with results, showing their space utilisation – notice that I gather schema stats immediately before running my query:
execute dbms_stats.gather_schema_stats(user, method_opt=>'for all columns size 1');
select
table_name,
avg_row_len,
num_rows,
round(avg_row_len * num_rows / 8100) used_space,
blocks
from
user_tables
order by
table_name
;
TABLE_NAME AVG_ROW_LEN NUM_ROWS USED_SPACE BLOCKS
-------------------- ----------- ---------- ---------- ----------
T1 85 5000 52 63
T2 85 5000 52 18
T3 107 5000 66 32
The division by 8,100 is a fairly crude approximation to convert to from bytes to blocks – it allows for the basic block overhead, but doesn’t make any allowance for PCTFREE, ITL entries, etc. It’s generally good enough for a ball-park figure though. So one of my tables uses about 52 blocks’ worth of space, below a highwater mark of 63 blocks – but two of the tables are, apparently, using more space than has been made available !
There are probably several mechanisms that could produce these results – without using any elaborately contrived sequences of activity – all I did was “create as select” to come up with two very simple options which fell outside the boundaries of what the simple SQL diagnostic was able to handle. Any ideas ? I’ll be posting my examples later on today.
It didn’t take long for the two answers I had in mind to appear – compression and LOBs - here are the examples I had prepared:
create table t1 as select * from all_objects where rownum <= 5000 ; create table t2 compress as select * from all_objects where rownum <= 5000 ;
To calculate the row lengths, dbms_stats sums the column lengths using a simple function call (sys_op_opnsize) to get the internal column size. When the table is compressed the basic query execution code reconstructs each row before applying the function – so Oracle ends up reporting the space that would be used if the table weren’t compressed. This is why I used the same data for two different tables – so that you could compare the results from the compressed and uncompressed tables.
create table t3 as select rownum id, empty_blob() b1 from all_objects where rownum <= 5000 ;
A column that holds a LOB always holds a “lob locator” even when the LOB is stored out of row, and even when the lob is the “empty lob” (which is different from NULL). To reduce the impact on the lob index for small LOBs, a lob locator can hold a variable number of pointers to lob chunks, so the size of the basic locator may vary quite dramatically – from 20 bytes to 84 bytes. When calculating the size of the row dbms_stats always bases its calculations on the maximum possible size of a LOB locator.
In my example the in-row empty_blob() actually used 20 bytes for the locator plus a further 16 for the lob content, but Oracle used 84 + 16 in its arithmetic rather than 20 + 16, resulting in a dramatic over-estimate for the avg_row_len.
If you’re interested, this is what the empty_blob() column looked line in a block dump when in row:
In row empty_blob()
===================
col 1: [36]
00 54 00 01 02 0c 00 00 00 01 00 00 00 01 00 00 00 42 7b b0 00 10 09 00 00
00 00 00 00 00 00 00 00 00 00 00
LOB
Locator:
Length: 84(36)
Version: 1
Byte Length: 1
LobID: 00.00.00.01.00.00.00.42.7b.b0
Flags[ 0x02 0x0c 0x00 0x00 ]:
Type: CLOB
Storage: BasicFile
Enable Storage in Row
Characterset Format: IMPLICIT
Partitioned Table: No
Options: ReadWrite
Inode:
Size: 16
Flag: 0x09 [ Valid DataInRow ]
Future: 0x00 (should be '0x00')
Blocks: 0
Bytes: 0
Version: 00000.0000000000
Inline data[0]
Oracle reported the column size as 103 (84 + 16 + 1 for the lock byte, one for the flags, and one for the column length) when it should have been 39.
The Inode section (16 bytes) goes to the lob segment if you disable storage in row.
I’ve just got back from watching Man of Steel at the cinema.
I went into this film with extremely low expectations. For people of my age, this is the third time round for this story, so I expected to be pretty bored from a plot perspective.
I’m going to split the film into three parts.
On my way out I was listening to a few conversations. One woman said, “The acting was terrible and I am so bored with seeing buildings get blown up!” I heard a group of guys talking in the car park and their conversation distilled down to, “He just didn’t do anything for the last half of the film!”
As it stands, I enjoyed it a lot more than I expected, but after a fantastic start it degenerated into mediocrity.
Cheers
Tim…
Lately I have been drawn into to a fare number of discussions about IO characteristics while helping customers run benchmarks. I have been working with a mix of developers, DBAs, sysadmin, and storage admins. As I have learned, every group has there own perspective – certainly when it comes to IO and performance.
As part of pulling back the covers, I came up with a simple little tool for show IOPS at the cell level.
The cellsrv process collects data about various events and performance metrics in an Exadata storage cell. I certainly am a huge fan of the table and index usage data gathered using the ”pythian_cell_cache_extract.pl” written by Christo Kutrovsky. It is really provides a great look inside the Exadata Smart Flash Cache. So, this got me to thinking. What about IOPS data?
With the introduction of the Write Back Flash cache in X3, there is much more analysis about what is going to flash vs disk – and how what is written to flash is flushed to disk.
To look at all the current metrics gathered from the storage cells in your Exadata or SuperCluster you can run “cellcli -e list metriccurrent” on all the storage cells. The “metriccurrent” parameters are updated every minute by cellsrv to store performance data. There are a few convient parameters that can be used to sum up all the IOPS.
These parameters shore the number of IO/sec for reads and writes. By mining this data and breaking it down by “FD” vs “CD” you can see hit ratios for reads from an overall cell point of view, but now you can also see how many writes are going to FLASH vs DISK.
The “ciops-all.sh” script will look at all the cells and sum up all the IOPS and report the findings. This is very useful to get a quick look at the IO profile in the cells.
[oracle@exa6db01 WB]$ ./ciops-all.sh FLASH_READ_IOPS: 6305 DISK_READ_IOPS: 213 FLASH_WRITE_IOPS: 488203 DISK_WRITE_IOPS: 6814 TOTAL_NUMBER_OF_DRIVES: 84 WRITE_PCT_to_FLASH: 98 READ_PCT_from_FLASH: 96 IOPS_PER_DISK: 83
This can be very helpful when trying to figure out if you need to go with high performance or high capacity disks. This case shows most IO going to the flash and only 83 IOPS are spilled to each disk. So, with this case HC disks would be a fine choice. With a simple modification, I made the “ciops-mon.sh” script to print out the throughput every few minutes to graph the results over time.
This has been helpful as I have been investigating and explaining the inter-workings of the Exadata smart flash cache. Hopefully, you will find this useful when trying to analyze and understand Exadata Cell level IO with your workload.
I just realised this week that I haven’t really detailed anything about policy managed RAC databases. I remembered having done some research about server pools way back when 11.2.0.1 came out. I promised to spend some time looking at the new type of database that comes with server pools: policy managed databases but somehow didn’t get around to doing it. Since I’m lazy I’ll refer to these databases as PMDs from now on as it saves a fair bit of typing.
So how are PMDs different from Administrator Managed Databases?
First of all you can have PMDs with RAC only, i.e. in a multi-instance active/active configuration. Before 11.2 RAC you had to tie an Oracle instance to a cluster node. This is why you see instance prefixes in a RAC spfile. Here is an example from my lab 11.2.0.3.6 cluster:
DEMO1.__db_cache_size=1073741824 [...] DEMO2.__streams_pool_size=0 *.audit_file_dest='/u01/app/oracle/admin/DEMO/adump' *.audit_trail='db' *.cluster_database=true *.compatible='11.2.0.0.0' *.control_files='...' *.db_block_size=8192 *.db_create_file_dest='+DATA' *.db_domain='' *.db_name='DEMO' *.db_recovery_file_dest='+RECO' *.db_recovery_file_dest_size=4558159872 *.diagnostic_dest='/u01/app/oracle' *.dispatchers='(PROTOCOL=TCP) (SERVICE=DEMOXDB)' DEMO2.instance_number=2 DEMO1.instance_number=1 *.nls_language='ENGLISH' *.nls_territory='UNITED KINGDOM' *.open_cursors=300 *.pga_aggregate_target=310378496 *.processes=300 *.remote_listener='rac11gr2scan.example.com:1521' *.remote_login_passwordfile='exclusive' *.sessions=335 *.sga_target=1610612736 DEMO1.thread=1 DEMO2.thread=2 DEMO1.undo_tablespace='UNDOTBS2' DEMO2.undo_tablespace='UNDOTBS1'
Note that the instance_number, thread and undo tablespace are manually (=administrator) managed. If these aren’t set or configured incorrectly you will run into all sorts of fun.
Enter the Policy Managed Database
In the never ending effort to make the DBA’s life easier Oracle decided to automate the process of RAC database creation and management. So instead of manually mapping instances to nodes in policy managed databases you map databases to server pools. Please refer to my earlier post about server pool management
http://martincarstenbach.wordpress.com/2010/02/12/server-pool-experiments-in-rac-11-2/
(this is the reason I blog by the way-to be able to look up things I don’t do every day!)
Basing myself on the post I checked my 2 node lab environment. Since all I ever ran on it are administrator managed databases I was not surprised to find the basic configuration:
[oracle@rac11gr2node1 ~]$ srvctl config srvpool Server pool name: Free Importance: 0, Min: 0, Max: -1 Candidate server names: Server pool name: Generic Importance: 0, Min: 0, Max: -1 Candidate server names: rac11gr2node1,rac11gr2node2 [oracle@rac11gr2node1 ~]$
As you can see there are a free and a generic pool. SRVCTL is a great utility when you are unsure how to do something, simply add a -h to it and you will see the syntax:
[oracle@rac11gr2node1 ~]$ srvctl add srvpool -h Adds a server pool to the Oracle Clusterware. Usage: srvctl add srvpool -g[-l ] [-u ] [-i ] [-n " "] [-f] -g Server pool name -l Minimum size of the server pool (Default value is 0) -u Maximum size of the server pool (Default value is -1 for unlimited maximum size) -i Importance of the server pool (Default value is 0) -n " " Comma separated list of candidate server names -f Force the operation even though some resource(s) will be stopped -h Print usage [oracle@rac11gr2node1 ~]$
Easy enough! I am now adding a new server pool across both nodes with high importance. Note that min=max=1 here. This is done on purpose to demonstrate how to extend the database to another node.
[oracle@rac11gr2node1 ~]$ srvctl add srvpool -g blog -l 1 -u 1 -n rac11gr2node1,rac11gr2node2
That was simple! But creating the pool is only half the job done. Does it have any servers assigned to it?
[oracle@rac11gr2node1 ~]$ srvctl status srvpool Server pool name: Free Active servers count: 0 Server pool name: Generic Active servers count: 2 Server pool name: blog Active servers count: 0
There are NO servers assigned. Remember when I said I was using admin managed databases? Their instances are using the GENERIC pool by default. So for the purpose of this blog post I had to remove their configuration. In real life you would of course not do this!
[oracle@rac11gr2node1 ~]$ srvctl config database DEMO RON
Both of these are admin managed. Let’s remove them (again for the sake of demonstration only)
[oracle@rac11gr2node1 ~]$ srvctl remove database -d RON Remove the database RON? (y/[n]) y [oracle@rac11gr2node1 ~]$ srvctl remove database -d DEMO Remove the database DEMO? (y/[n]) y
Now it’s looking better, or so I hope:
[oracle@rac11gr2node1 ~]$ srvctl config srvpool -g blog Server pool name: blog Importance: 999, Min: 1, Max: 1 Candidate server names: rac11gr2node1,rac11gr2node2 [oracle@rac11gr2node1 ~]$ [oracle@rac11gr2node1 ~]$ srvctl status srvpool -g blog Server pool name: blog Active servers count: 1 [oracle@rac11gr2node1 ~]$
Hurray, I have my server!
Now over to dbca to create the policy managed database. This is not different at all from creating the admin managed database except for 1 step, shown in the figure below:
Note how I have chosen the new server pool with 1 active server. The database type is set to policy managed as well. After the usual series of “next”, “next”, “next” the database will eventually be ready for use:
Now let’s have a look at the resource definition for a policy managed database:
[oracle@rac11gr2node1 ~]$ srvctl config database -d poldb Database unique name: poldb Database name: poldb Oracle home: /u01/app/oracle/product/11.2.0/dbhome_1 Oracle user: oracle Spfile: +DATA/poldb/spfilepoldb.ora Domain: Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: blog Database instances: Disk Groups: DATA,RECO Mount point paths: Services: Type: RAC Database is policy managed
In this case the notion of “server pools” makes a lot of sense. Interestingly no instance is recorded, but there is one:
[oracle@rac11gr2node1 ~]$ srvctl status database -d poldb Instance poldb_1 is running on node rac11gr2node2
For those who used RAC One before the $ORACLE_SID will look familiar. As with RON, you get the DB_NAME plus instance number suffix.
Adding an instance
Now let’s increase the size of the pool by another node and see what happens. This in the only command I typed:
[oracle@rac11gr2node1 ~]$ srvctl modify srvpool -g blog -u 2
It takes a little while to complete though, but here’s the magic:
[oracle@rac11gr2node1 ~]$ srvctl status database -d poldb Instance poldb_2 is running on node rac11gr2node1 Instance poldb_1 is running on node rac11gr2node2 [oracle@rac11gr2node1 ~]$
Now is that cool or isn’t it? So without me doing anything Oracle created a new online redo thread, another undo tablespace etc. I found this in the alert.log of instance 1:
2013-06-16 21:57:16.082000 +01:00 Reconfiguration started (old inc 2, new inc 4) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete alter database add logfile thread 2 SIZE 52428800 , SIZE 52428800 , SIZE 52428800 minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:4 errcnt:0 2013-06-16 21:57:57.487000 +01:00 Completed: alter database add logfile thread 2 SIZE 52428800 , SIZE 52428800 , SIZE 52428800 alter database enable public thread 2 2013-06-16 21:57:58.649000 +01:00 Completed: alter database enable public thread 2 create SMALLFILE UNDO TABLESPACE UNDOTBS2 datafile SIZE 83886080 AUTOEXTEND ON NEXT 5242880 MAXSIZE 34359721984 ONLINE 2013-06-16 21:58:22.055000 +01:00 Completed: create SMALLFILE UNDO TABLESPACE UNDOTBS2 datafile SIZE 83886080 AUTOEXTEND ON NEXT 5242880 MAXSIZE 34359721984 ONLINE 2013-06-16 22:00:49.614000 +01:00
Services
Services are different in policy managed databases too. Two different models exist: singleton and uniform services. The singleton is active on a single instance only (which you don’t get to choose). A uniform service is active on all instances. Forget about preferred and available nodes, that’s for admin managed databases only! Oracle is slightly misleading here:
[oracle@rac11gr2node2 ~]$ srvctl add service -d poldb -s uni PRKO-3116 : '-g' or '-r' option should be provided [oracle@rac11gr2node2 ~]$ srvctl add service -d poldb -s uni -r poldb_1 PRKO-3114 : Policy-managed database poldb can not support administrator-managed service single. [oracle@rac11gr2node2 ~]$ srvctl add service -d poldb -s single -g blog
So again-no preferred instance, but only a server pool to specify. Once the services are created, start them as usual. The status message will be similar to this one:
[oracle@rac11gr2node2 ~]$ srvctl status service -d poldb Service single is running on nodes: rac11gr2node1 Service uni is running on nodes: rac11gr2node1,rac11gr2node2
Summary
That’s it for an introduction to policy managed databases. These are cool things, especially if you think about the possibilities to create nested server pools or automatically draw in servers from lower priority pools to those with high priority to compensate a failed node. I don’t have enough horse power on my lab server to create a 4 node cluster but I’d love to play around with these features one day.
Having said that I don’t really know of anyone who uses PMDs and server pools in production. If you happen to be one, please comment and let me know!
(* Inspired by Cary Millsap, although the story isn't as elegant as one of his ...)
It's 3 weeks, 4 weeks, 2 months since I disembarked from my second boat cruise in a week but it already seems longer. Perhaps it's because my client had thoughtfully waited until I got back into the office for the next Production release of the application I've been working on. That went well, but the subsequent infrastructure performance issues were the latest part of a 5 week long haze, and then it was time to handover to my replacement and make a start on a new role. It's been an extremely busy few months :-((
Sunday
My almost traditional pre-conference illness (maybe it's because these things are in Autumn and Spring or maybe it's pre-conference stress - who knows?) and the last frantic bits of work for the release on the horizon meant that I had very little time to work on my two new presentations so when I set off for Helsinki on Sunday, it was with laptop powered up and ready to go. Fortunately I had managed at least half of one of the presentations before I got ill. What I didn't have yet was a hotel room in Oslo for the Wednesday night because when I got around to checking on Friday, there was literally nothing. I was utterly baffled (really, how often is there *nothing* except for a hostel?) but decided I could sort that out later ....
The time difference stole another two hours from me (these things seem more signifcant when your back's against the wall) but it was a pleasant, uneventful trip and a quiet night in the hotel prepared me for setting sail on Silja Serenade the next morning.
Monday
Having never spent any time in Helsinki, I decided I might as well walk to the boat and Helsinki looks like a beautiful city that I'll be going back to at some point. Once on board and unpacked in my beautiful sea-view room, I realised that there was a stowaway on board! He has a habit of getting lost ...

Tom Kyte and Bryn Llewellyn had the good grace to give keynote presentations using slides that I'd seen several times by now which gave me most of the first day to work on ... (you get the picture). Never have I been so happy to presenting on the second day of a two-day conference
But I was very keen to see Melanie Caffrey's 'Keeping It Simple in Database Application Development'. She'd been giving this presentation at Oak Table Sunday last year at the same time as my 8-bit presentation, so I'd missed out then.
Although I know Melanie through her being a regular at the annual UKOUG conference, I'm not sure that I've ever seen her present much before, but she was as engaging and smart as I expected as she went through some of the lessons learned in her role as a Senior Development Manager at Oracle, working on linux.oracle.com. I wish I could remember more of them now, but sometimes I'm so busy listening to the messenger that I can miss some of the message! LOL
I'd warned Melanie how weird it is to be presenting just as the boat sets sail because I remember it from my first time and, sure enough, stuff started clanging, the boat started humming and swaying ever so slightly - it's an experience everyone should have at least once
She also learned just how polite and quiet Finnish people are when in a sober audience of more than a few people. Never try a rhetorical question to the audience - you're likely to just get stoney silence back! However, they truly come into their own when you add beer or have a more private conversation - somewhat like most Scots I know!
After that, Heli Helskyaho (@HeliFromFinland) gave the OUGF 25th Anniversary Speech, flipping effortlessly between Finnish and English which is about as impressive as it gets to my ears. Although I have to say, it seems that this 25th Anniversay thing is a bit debatable. Is it 25 or 26 or .... I think someone forgot to start counting!
One of the most enjoyable aspects of a boat conference is the food, and we had one of the first of many great meals, all with plenty of wine thrown in and I was lucky enough to have Melanie, Debra Lilley and Alex Nuijten for company so it was a great way to wrap the day up, particularly as I had almost finished my slides ![]()
Tuesday
Tuesday was a day of worrying about and deliverying two presentations and feeling not a little unwell. It wasn't so much seasickness as the continuation of the long-running cold that I'd shared with lots of other people, so I was a little tense when I started presenting "10053 Trace Files - Mostly Harmless", particularly as Tom Kyte and Melanie were in the audience, knowing Tom's encyclopedic knowledge of Oracle! But it seemed to go reasonably well for the first time giving this presentation, but with some areas that I can improve, particularly some more useful examples.
Then there was time to actually eat some lunch (which is an achievement for me on presentation day) before I moved on to "Fast ETL Processes using Native Oracle Features". But im the lead-up, I felt really sick and I was concerned I might perform the impressive but disgusting trick of being physically sick mid-presentation! I mentioned the possibility on Twitter, so Alex made sure he turned up with his phone at the ready so he could tweet the evidence! Fortunately I got through it ok. Again, it probably needs a little more polish but the small room was absolutely packed and it went well enough. What I needed most was to lie down and rest, which meant that I missed Alex's Analytic Functions presentation again (something I have a long personal history of doing ...)
It was worth it though, as I was in a much better state when I made it back to the conference area just in time for Heli's wrap-up session. As well as having a prize-giving raffle from the numbers on peoples conference badges during which some really inconsiderate delegates walked off with all of the Cuddly Toys ;-), Heli tossed balls around the room to elicit feedback from whichever attendee caught the balls and she asked what was their best thing about the conference. At first I thought it might be embarassing and took a while to get going (Scots and Finns have a lot in common!) but it actually worked really well because I don't think anyone would have volunteered otherwise
Maybe the real reason I liked the idea was when one of the delegates said that his favourite thing was my presentations, which was brilliant to hear. I always speak at a lot of conferences with great speaker line-ups, so I've never actually been told that in the past! Others who I won't mention are used to it on a regular basis *cough* Jonathan, Tom, Cary ... *cough*
Which put me in a great mood for the final dinner which was delicious, as all the catering was. Did I mention? But if you could have seen Bryn, Tom and Melanie *piling* the starters on to their plates, you would know I wasn't exaggerating. In fact, there were so many interesting and tasty options (particularly for those with wider tastes than me) that a lot of people just about managed a bunch of starters and then a few puddings. Between the company and the food, I didn't think it could get much better, but then I was voted Speaker of the Day. Again, that's something that never happened to me and I probably shouldn't care, but I'm afraid I did, particularly when I considered the other speakers on the agenda. Of course, I cheated a bit by doing two presentations on one day but, hey, I'll take what I can get
Heli presented me with a Moomin mug and some whisky

In the interests of actually submitting a blog post, I'll leave it there for now. Justin Bieber? He can wait!
The TIMESTAMP WITH TIME ZONE data type that got introduced a long time ago is known for some oddities, for example Tony Hasler has a nice summary of some of them here.Here is another oddity that shows up when trying to aggregate on such a data type. Have a look at the following simple example:
create table t
as
select
rownum as id
, date '2000-01-01' + rownum - 1 as some_date
, cast(date '2000-01-01' + rownum - 1 as timestamp) as some_timestamp
, cast(date '2000-01-01' + rownum - 1 as timestamp with local time zone) as some_timestamp_with_local_tz
, cast(date '2000-01-01' + rownum - 1 as timestamp with time zone) as some_timestamp_with_timezone
from
dual
connect by
level <= 1e6
;
exec dbms_stats.gather_table_stats(null, 't')
explain plan for
select count(*), some_date from t group by some_date;
explain plan for
select count(*), some_timestamp from t group by some_timestamp;
explain plan for
select count(*), some_timestamp_with_local_tz from t group by some_timestamp_with_local_tz;
explain plan for
select count(*), some_timestamp_with_timezone from t group by some_timestamp_with_timezone;
The first three all will return the same execution plan:
-----------------------------------
| Id | Operation | Name |
-----------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH GROUP BY | |
| 2 | TABLE ACCESS FULL| T |
-----------------------------------
Notice the HASH GROUP BY operation selected by default by the optimizer (which can be influenced using the [NO_]USE_HASH_AGGREGATION hint to switch between a SORT and HASH GROUP BY).But for the TIMESTAMP WITH TIME ZONE column, the following execution plan will be shown:
-----------------------------------
| Id | Operation | Name |
-----------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT GROUP BY | |
| 2 | TABLE ACCESS FULL| T |
-----------------------------------
Notice the SORT GROUP BY instead - and this cannot be influenced using the above mentioned hint. So when using TIMESTAMP WITH TIME ZONE, the hash aggregation obviously isn't supported, but for all other TIMEZONE data types it is.Depending on the scenario this might already influence the performance as the HASH based aggregation in many cases is more efficient than the sort based aggregation (bugs aside).Things however get really bad when using Parallel Execution:
explain plan for
select /*+ parallel(t 4) */ count(*), some_date from t group by some_date;
explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp from t group by some_timestamp;
explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp_with_local_tz from t group by some_timestamp_with_local_tz;
explain plan for
select /*+ parallel(t 4) */ count(*), some_timestamp_with_timezone from t group by some_timestamp_with_timezone;
Again for the first three, we get the same execution plan:
-------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | PX COORDINATOR | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | Q1,01 | PCWP | |
| 4 | PX RECEIVE | | Q1,01 | PCWP | |
| 5 | PX SEND HASH | :TQ10000 | Q1,00 | P->P | HASH |
| 6 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T | Q1,00 | PCWP | |
-------------------------------------------------------------------------
Notice how the (hash) aggregation is performed as a parallel operation (and you might even see a so called GROUP BY PUSHDOWN from 11g on represented by a second GROUP BY operation before re-distributing the data depending on the cardinalities or the usage of the [NO_]GPY_PUSHDOWN hint).Now look closely at the execution plan of the last statement using the TIMESTAMP WITH TIME ZONE data type:
-----------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT GROUP BY | | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM)| :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 4 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 5 | TABLE ACCESS FULL| T | Q1,00 | PCWP | |
-----------------------------------------------------------------------
So the (sort) aggregation is no longer performed in parallel and this means that the single Query Coordinator process has to perform the aggregation on its own, which clearly can be a threat when dealing with larger data sets that need to be aggregated.
Since internally the DISTINCT / UNIQUE operator uses the same implementation as the GROUP BY operator, exactly the same limitations apply when trying to do a DISTINCT / UNIQUE on a TIMESTAMP WITH TIME ZONE data type.I could reproduce the described behaviour on all versions tested, starting from 10.2.0.4 and including the latest 11.2.0.3.
A recent posting on OTN came up with a potentially interesting problem – it started roughly like this:
I have two queries like this:
select * from emp where dept_id=10 and emp_id=15;
select * from emp where dept_id=10 and emp_id=16;When I run them separately I get the execution plan I want, but when I run a union of the two the plans change.
This, of course, is extremely unlikely – even if we assume that the two queries are more complex than the text shown. On the other hand you might, after a little thought, come up with the idea that perhaps the optimizer had done something really clever like join factorization (moving a join that’s common to the two parts of the UNION from inside to outside the UNION), or maybe there’s some really new trick the optimizer had played because a UNION ultimately requires a SORT UNIQUE, and the optimizer had chosen a different path that returned the data from each part of the UNION in sorted order to decrease the cost of that final sort.
In fact it turned out to be a lot simpler than that. The query looked more like this:
select /*+ index(@qb_view_a t1) index(@qb_view_b t1) */ * from t2, qb_view where t2.n1 = 10 and qb_view.n2 = t2.n2 union select /*+ index(@qb_view_a t1) index(@qb_view_b t1) */ * from t2, qb_view where t2.n1 = 12 -- the real code referenced an alternative column here. and qb_view.n2 = t2.n2 ;
Of particular note is the fact that it’s a join, the join involves a view (guessing from the names in the FROM clause) and there are hints that reference query block names for query blocks that don’t exist – but perhaps are present inside the view. So what does the view look like.
create or replace view qb_view as select /*+ qb_name(qb_view_a) */ * from t1 union all select /*+ qb_name(qb_view_b) */ * from t1 ;
It’s a union all view – and the two query blocks named from the outside query are the two halves of the inner union.
Here’s an important thought – it’s quite easy to get Oracle to do what you want in a simple query (at least in the short term) by sticking in a few hints – especially if you create and reference query block names; but when you start compounding queries by combining bits of code that currently do what you want, you may find that Oracle introduces extra query blocks during transformation, and perhaps some of the query blocks you’ve referenced originally cease to exist, so the hints no longer apply.
Let’s look at the execution plan – including the ALIAS and OUTLINE sections – for the final query, and compare it with the execution plan for just one of the two pieces; starting with the single piece first:
explain plan for
select
/*+
index(@qb_view_a t1)
index(@qb_view_b t1)
*/
*
from
t2, qb_view
where
t2.n1 = 10
and qb_view.n2 = t2.n2
;
select * from table(dbms_xplan.display(null,null,'basic +outline +alias'));
--------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN | |
| 2 | TABLE ACCESS FULL | T2 |
| 3 | VIEW | QB_VIEW |
| 4 | UNION-ALL | |
| 5 | TABLE ACCESS BY INDEX ROWID| T1 |
| 6 | INDEX FULL SCAN | T1_I1 |
| 7 | TABLE ACCESS BY INDEX ROWID| T1 |
| 8 | INDEX FULL SCAN | T1_I1 |
--------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$1
2 - SEL$1 / T2@SEL$1
3 - SET$1 / QB_VIEW@SEL$1
4 - SET$1
5 - QB_VIEW_A / T1@QB_VIEW_A
6 - QB_VIEW_A / T1@QB_VIEW_A
7 - QB_VIEW_B / T1@QB_VIEW_B
8 - QB_VIEW_B / T1@QB_VIEW_B
Outline Data
-------------
/*+
BEGIN_OUTLINE_DATA
INDEX(@"QB_VIEW_A" "T1"@"QB_VIEW_A" ("T1"."N1"))
INDEX(@"QB_VIEW_B" "T1"@"QB_VIEW_B" ("T1"."N1"))
USE_HASH(@"SEL$1" "QB_VIEW"@"SEL$1")
LEADING(@"SEL$1" "T2"@"SEL$1" "QB_VIEW"@"SEL$1")
NO_ACCESS(@"SEL$1" "QB_VIEW"@"SEL$1")
FULL(@"SEL$1" "T2"@"SEL$1")
OUTLINE(@"QB_VIEW_B")
OUTLINE(@"QB_VIEW_A")
OUTLINE_LEAF(@"SEL$1")
OUTLINE_LEAF(@"SET$1")
OUTLINE_LEAF(@"QB_VIEW_B")
OUTLINE_LEAF(@"QB_VIEW_A")
ALL_ROWS
DB_VERSION('11.2.0.3')
OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
IGNORE_OPTIM_EMBEDDED_HINTS
END_OUTLINE_DATA
*/
The plan shows us that we have used an index to access table t1 (with an index full scan) in both halves of the QB_VIEW’s union all; and the alias section shows us that we have a table t1 in a query block name qb_view_a, and the outline section shows that we have a hint applied in that query block that directs the optimizer to use an index on that table: INDEX(@”QB_VIEW_A” “T1″@”QB_VIEW_A” (“T1″.”N1″)); and we can see the same strategy appearing for a table t1 in query block qb_view_b. By the way, I checked the plan without the hints, and the optimizer chose to do full tablescans on t1 – so the hints were actually having an effect.
So what happens when we check the plan for the UNION of the two variants of the query:
------------------------------------------
| Id | Operation | Name |
------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT UNIQUE | |
| 2 | UNION-ALL | |
| 3 | HASH JOIN | |
| 4 | TABLE ACCESS FULL | T2 |
| 5 | VIEW | QB_VIEW |
| 6 | UNION-ALL | |
| 7 | TABLE ACCESS FULL| T1 |
| 8 | TABLE ACCESS FULL| T1 |
| 9 | HASH JOIN | |
| 10 | TABLE ACCESS FULL | T2 |
| 11 | VIEW | QB_VIEW |
| 12 | UNION-ALL | |
| 13 | TABLE ACCESS FULL| T1 |
| 14 | TABLE ACCESS FULL| T1 |
------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SET$1
3 - SEL$1
4 - SEL$1 / T2@SEL$1
5 - SET$2 / QB_VIEW@SEL$1
6 - SET$2
7 - SEL$2 / T1@SEL$2
8 - SEL$3 / T1@SEL$3
9 - SEL$4
10 - SEL$4 / T2@SEL$4
11 - SET$3 / QB_VIEW@SEL$4
12 - SET$3
13 - SEL$5 / T1@SEL$5
14 - SEL$6 / T1@SEL$6
Outline Data
-------------
/*+
BEGIN_OUTLINE_DATA
FULL(@"SEL$5" "T1"@"SEL$5")
FULL(@"SEL$6" "T1"@"SEL$6")
FULL(@"SEL$2" "T1"@"SEL$2")
FULL(@"SEL$3" "T1"@"SEL$3")
USE_HASH(@"SEL$1" "QB_VIEW"@"SEL$1")
LEADING(@"SEL$1" "T2"@"SEL$1" "QB_VIEW"@"SEL$1")
NO_ACCESS(@"SEL$1" "QB_VIEW"@"SEL$1")
FULL(@"SEL$1" "T2"@"SEL$1")
USE_HASH(@"SEL$4" "QB_VIEW"@"SEL$4")
LEADING(@"SEL$4" "T2"@"SEL$4" "QB_VIEW"@"SEL$4")
NO_ACCESS(@"SEL$4" "QB_VIEW"@"SEL$4")
FULL(@"SEL$4" "T2"@"SEL$4")
OUTLINE_LEAF(@"SET$1")
OUTLINE_LEAF(@"SEL$4")
OUTLINE_LEAF(@"SET$3")
OUTLINE_LEAF(@"SEL$6")
OUTLINE_LEAF(@"SEL$5")
OUTLINE_LEAF(@"SEL$1")
OUTLINE_LEAF(@"SET$2")
OUTLINE_LEAF(@"SEL$3")
OUTLINE_LEAF(@"SEL$2")
ALL_ROWS
DB_VERSION('11.2.0.3')
OPTIMIZER_FEATURES_ENABLE('11.2.0.3')
IGNORE_OPTIM_EMBEDDED_HINTS
END_OUTLINE_DATA
*/
The plan is completely different. We’ve lost the index full scans and we’ve reverted to the tablescans that we would have got from an unhinted query. When you look at the outline you can see why – the query blocks qb_view_a and qb_view_b have disappeared – so the hints are no longer valid. As you can see we now have four occurrences of table t1, but as the alias section shows they come from query blocks sel$2, sel$3, sel$5 and sel$6).
Is this a bug ? I don’t think so. When the optimizer produces the outline information (which can be stored as an SQL Baseline in 11g) it’s producing a set of hints that will be applied at the outermost query block, with hints that point, as necessary, to inner query blocks; this means you can’t use the same query block name twice in a query or the optimizer wouldn’t be able to identify which query block a hint was supposed to apply to. So Oracle has eliminated duplicate query block names and replaced them with the standard internally generated ones – the user’s hints no longer apply.
Footnote: Checking the clock – it took me about 15 minutes to create a simplified model based on the information available on OTN: it’s taken me 75 minutes to describe what I did and what I learned as a result. With a little practice you can get very good at creating models that help you to identify and solve problems very quickly.
Following the webinars about 11g stats that I presented on Monday John Goodhue emailed me a few questions that had come through the chat line while I was speaking, but hadn’t appeared on my screen. He’s emailed them to me, so here are the questions and answers.
1. I’d like to know what parameter to use for faster results on dbms_stats.gather_dictionary_stats
I don’t think I’ve come across any complaints about gathering dictionary stats being particularly slow, and you haven’t mentioned a specific version so I can’t make any comment about possible bugs. If you can, you first need to find out where most of the time is going, so tracing the call would be a good idea to start with. If you can’t trace it you could look at the “standard” types of problem such as: AWR history tables getting very large, optimizer stats history tables (particularly the histogram one) getting very large; synopsis tables getting very large. Histograms can be very expensive to collect, especially on partitioned tables, and many of the 11g dictionary tables are partitioned – so you might want to changed the method_opt for the collection to “for all columns size 1″. It is possible to get Oracle to dispatch multiple jobs to collect stats (the global preference CONCURRENT) if the issue is elapsed time rather than workload – check dbms_stats.set_global_prefs(), parameter CONCURRENT.
2. Do incremental statistics also work with sub-partitions?
No. Even when you have a composite partitioned table, Oracle only uses the approximate NDV method to create synopses at the partition level. Given that a synopsis is up to 16,384 rows for each column for each partition, the volume of synopsis information that could be generated by extending the mechanism down to the subpartition level could be catastrophic.
3. I have always been benefiting from block_sample=>TRUE .. I do not see you using it – is there any reason to not use it {while having in mind of course, that the accuracy may suffer a bit, with heap tables, at the cost of faster gathered stats}
The only good reason not to use it is the reason you’ve given - there are patterns of skewed data that could make a block sample even less accurate on the number of distinct values than a row sample. The particular benefit of the block sample (for “safe” distributions) is that even a small percentage row sample may require Oracle to read every block of the table as it picks pseudo-random rows – perhaps even doing “db file scattered read” requests to scan the whole table and discarding lots of blocks (one of reasons behind statistic “prefetched blocks aged out before use”) because it didn’t need rows from them. With the 11g approximate NDV, of course, the question is moot since the new mechanism examines every row in the table.
4. is this #rows * #columns sorting for count distinct as well happening for any distinct? (as SELECT DISTINCT is generally a group by every selected column)
You’re referring to my comment about “select count(distinct n1) , count(distinct n2). count(distinct n3) from table;” seems to turn each row into N rows (where N = 3 in this case) of (column id, column value) and then sort the entire set in one operation rather than N concurrent operations. I don’t think you can generalise this, you’d have to check each SQL pattern separately – initially just checking the execution plan. For example, something like ‘select distinct n1, n2, n3 from table” is finding distinct combinations, and it will do a hash unique on the number of (non-null) rows in the table irrespective of the number of columns selected.
5. what is your view on using table preferences to explicitly specified required histograms?
I made a brief statement about collecting histograms for anything but your explicitly chosen columns; following up with the suggestion that your default action should be “for all columns size 1″ followed by explicit code to generated histograms for specific tables. I would be perfectly happy to set the global prefs to “for all columns size 1″ and then set table prefs to something specific for the columns I really wanted; the only reason that I don’t put this forward as a standard practice is that it does conceal what’s going on, and it’s easy to forget that some tables have preferences set. If your organisation is really good about documentation of systems then it makes perfect sense. Having said that, though, don’t forget that I also pointed out that sometimes you may still need to create “fake” histograms to get the best possible results.
6. (paraphrased) I think Jonathan should have acknowledged that his presentation was based on the work of Amit Poddar.
I agree. This was an error of omission that I shouldn’t committed. It has been several years since Amit first described how Oracle uses a hash table for the approximate NDV and keeps halving it when the number of hash values reaches 16,384. When he closed his website he allowed me to publish the original presentation and white paper (64 pages) on my blog so I really should have remembered that the algorithm is still so little-known that he still merited attribution. As far as I am aware his work is still the most thorough and detailed description of how the algorithm works, although Oracle has extended it in 12c (as explained in the presentation I did about histograms for OOW 2012).
If you manage Oracle on Windows, you probably have wondered why it is so difficult to work out what Oracle instances are running and which ORACLE_HOMEs they use. On Unix or Linux, this is a very simple task. Oracle services and their ORACLE_HOMEs are listed in the oratab file, located in /etc/ on most platforms, and in /var/opt/oracle/ on Solaris. To find what is running, we would usually use the ‘ps’ command, and pipe it through grep to find and running PMON processes.
On Windows, it just isn’t this easy. Each Oracle instance runs in a single monolithic oracle.exe process. Nothing about the process indicates the name of the instance. When we want to find all of the configured Oracle services, we can use the ‘sc’ command, and pipe the results through find (I have added emphasis to the ASM and database instances:
C:\> sc query state= all | find "SERVICE_NAME" | find "Oracle" SERVICE_NAME: Oracle Object Service SERVICE_NAME: OracleASMService+ASM1 SERVICE_NAME: OracleClusterVolumeService SERVICE_NAME: OracleCRService SERVICE_NAME: OracleCSService SERVICE_NAME: OracleDBConsoleorcl1 SERVICE_NAME: OracleEVMService SERVICE_NAME: OracleJobSchedulerORCL1 SERVICE_NAME: OracleOraAsm11g_homeTNSListener SERVICE_NAME: OracleProcessManager SERVICE_NAME: OracleServiceORCL1 SERVICE_NAME: OracleVssWriterORCL1
For any one of these services, you can get the current state with ‘sc query’, and the path of the ORACLE_HOME it is using with ‘sc qc’.
C:\> sc query OracleServiceORCL1
SERVICE_NAME: OracleServiceORCL1
TYPE : 10 WIN32_OWN_PROCESS
STATE : 4 RUNNING
(STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
C:\> sc qc OracleServiceORCL1
SERVICE_NAME: OracleServiceORCL1
TYPE : 10 WIN32_OWN_PROCESS
START_TYPE : 3 DEMAND_START
ERROR_CONTROL : 1 NORMAL
BINARY_PATH_NAME : c:\oracle\product\11.2.0\db\bin\ORACLE.EXE ORCL1
LOAD_ORDER_GROUP :
TAG : 0
DISPLAY_NAME : OracleServiceORCL1
DEPENDENCIES :
SERVICE_START_NAME : LocalSystem
As you can see, the ORACLE_HOME, and SID are visible on the line labeled ‘BINARY_PATH_NAME’. Once you have this information, you can set your environment accordingly. It might even be worth your time to write a simple script to do this for you. Maybe you could call it ‘oraenv’!
C:\> set ORACLE_SID=ORCL1 C:\> set ORACLE_HOME=c:\oracle\product\11.2.0\db C:\> set PATH=%ORACLE_HOME%\bin;%PATH%
The latter part of the title of this blog post first!. I submitted a couple of entries for the up-coming UKOUG Oracle conference this year; I hope that they will be accepted. The Judging process is on going now. The....[Read More]
Posted by Pete On 14/06/13 At 09:57 AM
Recent comments
21 weeks 2 days ago
31 weeks 11 hours ago
32 weeks 5 days ago
35 weeks 6 days ago
38 weeks 1 day ago
47 weeks 5 days ago
49 weeks 2 days ago
50 weeks 2 days ago
50 weeks 3 days ago
1 year 1 week ago