Who's online

There are currently 0 users and 38 guests online.

Recent comments


Oakies Blog Aggregator

So you don’t see any disks when trying to install ASM?

This is a post that highlights the difference between operating systems, but also the fact that sometime it is hard to break out of a habit once you got used to it. My background is that of a Linux enthusiast, even though I equally like Solaris and AIX but I have a little less exposure to those.


I have recently been asked to look at RAC on SPARC, which I gladly did. The system I was given had the usual software stack for RAC at this customer’s site. It comprised of:

  • Solaris 10 Update 9 64bit on SPARC
  • EMC Power Path 5.1
  • EMC VMAX storage – 10x10G LUNs for a specific performance test

The Power Path configuration has already been in place when I got the machine, and I was allocated /dev/emcpower2[0-9] for my ASM testing. For the experienced Linux user who relies on device-mapper-multipath, the Power Path naming convention can be a bit confusing at first. For reference, the pseudo devices we are interested in for ASM are created under /dev/rdsk/-the “raw” device directory for “character” based access rather than the block device in /dev/dsk/.  By default, the Power Path devices are called “emcpower”, followed by a number and a letter (in SPARC). An example would be /dev/rdsk/emcpower20c.

The number (20) is just a sequence number and doesn’t have any other meaning as far as I know. It also doesn’t seem to be consistent across the cluster nodes by default. The suffix-letter (“c”) does have a meaning though: it indicates the slice of the device. For example, the latter “a” indicates slice 0, xxxc is the whole disk, and xxxg is slice 6.

You can use the format command to have a look at the slices currently defined. Simply type “partition” and then “print” to print it.

Now all the disks I was given had a slice 6, from cylinder 274 to the end of the disk:

# format
[disk selection not shown here]
format> partition
partition> print
Current partition table (original):
Total disk cylinders available: 10238 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 -   136      128.44MB    (137/0/0)     263040
  1       swap    wu     137 -   273      128.44MB    (137/0/0)     263040
  2     backup    wu       0 - 10237        9.37GB    (10238/0/0) 19656960
  3 unassigned    wm       0                0         (0/0/0)            0
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6        usr    wm     274 - 10237        9.12GB    (9964/0/0)  19130880
  7 unassigned    wm       0                0         (0/0/0)            0

partition> quit

If you’d like to see how the disks are connected to the array, of if there are any faults, use powermt display dev=all, or alternatively look at a specific device:

# powermt display dev=emcpower20
Pseudo name=emcpower20a
Symmetrix ID=00x2x4x0x6x4
Logical device ID=0x0x
state=alive; policy=SymmOpt; priority=0; queued-IOs=0;
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
3072 pci@400/pci@0/pci@c/fibre-channel@0/fp@0,0 c2t50000974C00A6118d17s0 FA  7eA   active  alive       0      0
3077 pci@500/pci@0/pci@9/fibre-channel@0/fp@0,0 c3t50000974C00A611Cd17s0 FA  8eA   active  alive       0      0

If you care to learn more about how your system administrator configured the paths (I have 2 HBAs as you can see), use “powermt display”. Also note the disks c2t5….s0 and c3t5…s0 are the native devices aggregated into emcpower20a.

I have changed the permissions on the pseudo devices to 664 and made them owned by oracle:oinstall before launching OUI. In the disk discovery screen I changed the disk discovery string to /dev/rdsk/emcpower2* but didn’t see a single disk in the OUI window. That was odd.

Troubleshooting disk discovery

This error lead me to look at /tmp/OraInstall/installActions.log which referenced a call to kfod, the disk discovery tool. The entry in the log file showed this content:

WARNING: SRVCTL not found as there is no Grid Infrastructure home on the box.
 INFO: Executing [/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod, nohdr=true, verbose=true, disks=all, status=true\
 op=disks, asm_diskstring='/dev/rdsk/emcpower2*']
 with environment variables {XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt, DISPLAY=localhost:12.0,\
 PWD=/u01/app/oracle/stage/grid, LC_CTYPE=en_GB.ISO8859-1, _=./runInstaller, USER=oracle,\
 NLSPATH=/usr/dt/lib/nls/msg/%L/, LC_COLLATE=en_GB.ISO8859-1, CLASSPATH=, HOME=/export/home/oracle,\
 LC_NUMERIC=en_GB.ISO8859-1, SSH_CONNECTION= 52883 22,\
 ORACLE_HOME=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin, SHELL=/usr/bin/bash, LOGNAME=oracle,\
 SSH_TTY=/dev/pts/3, SHLVL=1, LC_MONETARY=en_GB.ISO8859-1, MANPATH=/usr/share/man:/usr/openwin/share/man:\
 /opt/SUNWexplo/man:/opt/SUNWcstu/man:/opt/VRTSvlic/man, SSH_CLIENT= 52883 22, \
 MAIL=/var/mail//oracle, TZ=GB-Eire, LC_TIME=en_GB.ISO8859-1, JAVA_HOME=, LC_MESSAGES=C, PS1=[\u@\h \W]> , \
 OLDPWD=/u01/app/oracle/stage, LC_ALL=, TERM=xterm, TMOUT=0, PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/opt/EMCpower/bin:\
 INFO: Starting Output Reader Threads for process /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod
 INFO: Parsing KFOD-00311: Error scanning device /dev/rdsk/c3t50000974C00A611Cd16s7
[more errors skipped]

(Re-) Discovery of KFOD

So therein lies the problem! As a side effect I rediscovered the command line options used with kfod! Maybe I can use this for some experiments… All you need to do is to set environment variables for ORACLE_HOME and LD_LIBRARY_PATH as per the above output.

After a little fiddling around I remembered that unlike the tests with ZFS I did before I can’t pass the whole disk to ASM (the VTOC is in the first part of the disk!), but rather have to use a slice. So by changing the disk string to /dev/rdsk/emcpower2*g I got a successful report from kfod, executed as “oracle”:

$ /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod verbose=true, disks=all status=true op=disks \
> asm_diskstring='/dev/rdsk/emcpower2*g'
 Disk          Size Header    Path                                     User     Group
   1:       9341 Mb CANDIDATE /dev/rdsk/emcpower20g                    oracle   oinstall
   2:       9341 Mb CANDIDATE /dev/rdsk/emcpower21g                    oracle   oinstall
   3:       9341 Mb CANDIDATE /dev/rdsk/emcpower22g                    oracle   oinstall
   4:       9341 Mb CANDIDATE /dev/rdsk/emcpower23g                    oracle   oinstall
   5:       9341 Mb CANDIDATE /dev/rdsk/emcpower24g                    oracle   oinstall
   6:       9341 Mb CANDIDATE /dev/rdsk/emcpower25g                    oracle   oinstall
   7:       9341 Mb CANDIDATE /dev/rdsk/emcpower26g                    oracle   oinstall
   8:       9341 Mb CANDIDATE /dev/rdsk/emcpower27g                    oracle   oinstall
   9:       9341 Mb CANDIDATE /dev/rdsk/emcpower28g                    oracle   oinstall
  10:       9341 Mb CANDIDATE /dev/rdsk/emcpower29g                    oracle   oinstall
KFOD-00311: Error scanning device /dev/rdsk/emcpower2g
ORA-15025: could not open disk "/dev/rdsk/emcpower2g"
SVR4 Error: 13: Permission denied
Additional information: 42
Additional information: 272256
Additional information: 12699536

I haven’t worked out which wildcard characters are valid in the ASM_DISKSTING (‘/dev/rdsk/emcpower2[0-9]g’ didn’t work), so I left it there. The emcpower2g device was not meant to be used in the test anyway.

With this setting I returned to the OUI window and continued the installation.

Lesson Learned

Oracle doesn’t always have useful logs, but in this case it was very useful to look at the install-actions file in /tmp. And also remember that creating ASM disks is different from Linux in Solaris and other platforms.

Logical I/O - Evolution: Part 2 - 9i, 10g Prefetching

In the initial part of this series I've explained some details regarding logical I/O using a Nested Loop Join as example.

To recap I've shown in particular:

- Oracle can re-visit pinned buffers without performing logical I/O

- There are different variants of consistent gets - a "normal" one involving buffer pin/unpin cycles requiring two latch acquisitions and a short-cut variant that visits the buffer while holding the corresponding "cache buffers chains" child latch ("examination") and therefore only requiring a single latch acquisition

- Although two statements use a similar execution plan and produce the same number of logical I/Os one is significantly faster and scales better than the other one

The initial part used the "classic" shape of the Nested Loop Join, but Oracle introduced in recent releases various enhancements in that area - in particular in 9i the "Table Prefetching" and in 11g the Nested Loop Join Batching using "Vector/Batched I/O".

Although these enhancements have been introduced primarily to optimize the physical I/O patterns, they could also have an influence on logical I/O.

The intention of Prefetching and Batching seems to be the same - they both are targeted towards the usually most expensive part of the Nested Loop Join: The random table lookup as part of the inner row source. By trying to "prefetch" or "batch" physical I/O operations caused by this random block access Oracle attempts to minimize the I/O waits.

I might cover the effect on physical I/O of both "Prefetching" and "Batching" in separate posts, here I'll only mention that you might see "db file scattered read" or "db file parallel read" multi-block I/O operations instead of single block "db file sequential read" operations for the random table access with those optimizations (Index prefetching is also possible, by the way). Note also that if you see the Prefetching or Batching plan shape it does not necessarily mean that it is actually going to happen at execution time - Oracle monitors the effectiveness of the Prefetching and can dynamically decide whether it will be used or not. Table Prefetching - Random order

Let's enable table prefetching in 10.2 and re-run the original test case. The first run will use the different order variant of T1 and T2:

Inner row source Unique Index - T1 different order than T2

| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.12 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.90 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.71 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.20 | 200K|

Inner row source Non-Unique Index - T1 different order than T2

| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:05.03 | 311K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:04.40 | 311K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:03.02 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.49 | 201K|

As you'll see, in 10g even with table prefetching enabled the unique index variant does look the same and performs similar as in the original post.

This changes in 11g by the way, where the unique index variant also supports the table prefetching plan shape.

For the non-unique variant you'll see a different shape of the execution plan where the inner row source random table lookup is actually a parent operation to the Nested Loop Join (and hence will only be started once and consumes the information generated by the child Nested Loop operation).

Note that in case of nested Nested Loop Joins only the inner-most row source will make use of the Table Prefetching shape. The same applies to the 11g Nested Loop Join Batching. If you happen to have several Nested Loops Joins that are not directly nested then each of the inner-most row sources might use the Table Prefetching/Batching shape - which means that it can be used more than once as part of a single execution plan.

If you compare the Runtime profile of the non-unique index variant with the original Runtime profile without Table Prefetching you'll not see any difference in terms of logical I/O, however it becomes obvious that the overall execution is actually slightly faster (more significant with row source sampling overhead enabled). In particular the random table access requires significantly less time than in the original Runtime profile, so it seems to be more efficient, although it is still slower than the unique index variant.

Begin Update

Having focused on the logical I/O I completely forgot to mention the inconsistency in the A-Rows column (thanks to Flado who pointed this out in his comment below), which shows 200K rows for the Nested Loop operation although only 100K rows have been identified in the inner index lookup. I believe this is an inconsistency that also shows up when performing an SQL trace so it seems to be a problem with the row source statistics. In principle with this plan shape the Nested Loop Join operation seems to account for the sum of both the rows identified in the driving row source and the inner index lookup, rather than the expected number of rows identified in the inner index lookup only.

However, as mentioned below in the "Statistics" section there is another anomaly - a consistent get and "buffer is pinned count" for every row looked up in the inner table, so this might not be just coincidence but another indicator that there is really some excess work happening with Table Prefetching.

By the way - both anomalies are still present in 11.1 / 11.2 when using Table Prefetching there.

End Update

Let's have a look at the session statistics.

Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..consistent gets 310,012 311,101 1,089
STAT..consistent gets from cache 310,012 311,101 1,089
STAT..session logical reads 310,012 311,101 1,089
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..buffer is not pinned count 200,001 99,997 -100,004
STAT..buffer is pinned count 189,998 290,006 100,008
STAT..consistent gets - examination 300,001 100,007 -199,994 work - consistent read gets 10,001 211,084 201,083
LATCH.cache buffers chains 320,031 522,195 202,164

So the only significant difference in this case is the increased "buffer is pinned count" / decreased "buffer is not pinned" count statistics, although the number of logical I/O stays the same. I don't know if this really means excess work with Table Prefetching enabled or whether this is an instrumentation problem. Nevertheless with Table Prefetching enabled in this case you'll end up with both a "buffer is pinned count" and "consistent get" for each row looked up in the inner row source table operation. The number of logical I/O and latch acquisitions stays the same, so it's not obvious from the statistics why this performs better than the non-Table Prefetching case - according to the statistics it even performs more work, but may be the table random access as parent operation to the Nested Loop allows a more efficient processing requiring less CPU cycles. Table Prefetching - Same (Random) order

Let's change the data order and use either the same "Pseudo-Random" order (by uncommenting the second "dbms_random.seed(0)" call) or order by ID - it doesn't matter with Table Prefetching in 10g.

Inner row source Unique Index - T1 and T2 same order

| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.91 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.54 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|

Inner row source Non-Unique Index - T1 and T2 same order

| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.54 | 221K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:03.90 | 221K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:02.82 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.40 | 201K|

Now we really see a difference: The unique index variant still shows the same results, but the non-unique variant saves logical I/O on the random table access - and is faster than with random order - coming closer to the unique index variant performance.

Whereas the index range scan still requires approx. 200,000 logical I/Os the random table access only requires 10,000 logical I/Os instead of 100,000.

The session statistics:

Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
LATCH.cache buffers chains 320,023 342,213 22,190
STAT..consistent gets 310,012 221,110 -88,902
STAT..consistent gets from cache 310,012 221,110 -88,902
STAT..session logical reads 310,012 221,110 -88,902
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000 work - consistent read gets 10,001 121,093 111,092
STAT..buffer is not pinned count 200,001 10,006 -189,995
STAT..buffer is pinned count 189,998 379,997 189,999
STAT..consistent gets - examination 300,001 100,007 -199,994

The session statistics confirm this: The "buffer is pinned count" increases by another 90,000 for the non-unique index variant which corresponds to the 90,000 logical I/Os performed less as part of the random table access operation.

The number of latch acquisitions decreases accordingly so that we end up with a comparable number as with the unique index variant.


If you run the non-unique index Table Prefetching variant with the concurrent execution test harness you'll see a corresponding slightly increased scalability although it still scales not as good as the unique index variant.


Table Prefetching has been introduced in Oracle 9i in order to optimize the random physical access in Nested Loop Joins, however it also seems to have a positive effect on logical I/O. The effectiveness of this optimization depends on the data order - if the data from the driving row source is in the same order as the inner row source table buffers can be kept pinned. Note that the same doesn't apply to the index lookup - even if the data is ordered by ID and consequently the same index branch and leaf blocks will be accessed repeatedly with each iteration, a buffer pinning optimization could not be observed.

In the next part we'll see what happens with this example in Oracle 11g and its new features.


I’ve been on holiday, walking in the Lake District, for a few days, so you haven’t seen much from me for a while (apart from a couple of post-dated items).

Now that I’m back I’ve got several hundred email messages, a couple of jobs, and a load of writing to catch up on, so you’re still not going to see much output. However, I have a of odds and ends that I’ve filed for later use, and I started tidying this list up a few weeks ago, so I thought I’d share the tidy bit with you – just to make sure you have some interesting reading over the next few days.

It’s published as a WordPress “page” rather than a post, I’ll probably post a new temporary link to it from time to time so that people can see when it gets updated. I’ve also added it to the “Special Links” list in the right-hand column.

Difference between Select Any Dictionary and Select_Catalog_Role

When you want to give a user the privilege to select from data dictionary and dynamic performance views such as V$DATAFILE, you have two options:

grant select any dictionary to ;
grant select_catalog_role to ;

Did you ever wonder why there are two options for accomplishing the same objective? Is one of them redundant? Won't it make sense for Oracle to have just one privilege? And, most important, do these two privileges produce the same result?

The short answer to the last question is -- no; these two do not produce the same result. Since they are fundamentally different, there is a place of each of these. One is not a replacement for the other. In this blog I will explain the subtle but important differences between the two seemingly similar privileges and how to use them properly.

Create the Test Case

First let me demonstrate the effects by a small example. Create two users called SCR and SAD:

SQL> create user scr identified by scr;
SQL> create user sad identified by sad;

Grant the necessary privileges to these users, taking care to grant a different one to each user.

SQL> grant create session, select any dictionary to sad;

Grant succeeded.

SQL> grant create session, select_catalog_role to scr;

Grant succeeded.
Let's test to make sure these privileges work as expected:

SQL> connect sad/sad


SQL> select * from v$session;
 ... a bunch of rows come here ...

SQL> connect scr/scr
SQL> select * from v$datafile;
 ... a bunch of rows come here ...
Both users have the privilege to select from the dictionary views as we expected. So, what is the difference between these two privileges? To understand that, let's create a procedure on the dictionary tables/views on each schema. Since we will create the same procedure twice, let's first create a script which we will call p.sql. Here is the script:
create or replace procedure p as
    l_num number;
    select count(1)
    into l_num
    from v$session;
The procedure is very simple; it merely counts the number of connected sessions by querying V$SESSION. When you connect as SAD and create the procedure by executing p.sql:

SQL> @p.sql

Procedure created.

The procedure was created properly; but when you connect as SCR and execute the script:

SQL> @p.sql

Warning: Procedure created with compilation errors.

SQL> show error
Errors for PROCEDURE P:

-------- ------------------------------------------------
4/2      PL/SQL: SQL Statement ignored
6/7      PL/SQL: ORA-00942: table or view does not exist

That must be perplexing. We just saw that the user has the privilege to select from the V$SESSION view. You can double check that by selecting from the view one more time. So, why did it report ORA-942: table does not exist?

Not All Privileges have been Created Equal

The answer lies in the way Oracle performs compilations. To compile a code with a named object, the user must have been granted privileges by direct grants; not through the roles. Selecting or performing DML statements do not care how the privileges were received. The SQL will work as long as the privileges are there. The privilege SELECT ANY DICTIONARY is a system privilege, similar to create session or unlimited tablespace. This is why the user SAD, which had the system privilege, could successfully compile the   procedure P.

The user SCR had the role SELECT_CATALOG_ROLE, which allowed it to SELECT from V$SESSION but not to create the procedure. Remember, to create another object on the base object, the user must have the direct grant on the base object; not through a role. Since SCR had the role not the direct grant on V$DATAFILE, it can't compile the procedure.

So while both the privileges allow the users to select from v$datafile, the role does not allow the users to create objects; the system privilege does.

Why the Role?

Now that you know how the privileges are different, you maybe wondering why the role is even there. It seems that the system grant can do everything and there is no need for a role. Not quite.The role has a very different purpose. Roles provide privileges; but only when they are enabled. To see what roles are enabled in a session, use this query:

SQL> connect scr/oracle
SQL> select * from session_roles
  2  /


2 rows selected.

We see that two roles - SELECT_CATALOG_ROLE and HS_ADMIN_SELECT_ROLE - have been enabled in the session. The first one was granted to the user. The other one is granted to the first one; so that was also enabled.
Just because a role was granted to the user does not necessarily mean that the role would be enabled. The roles which are marked DEFAULT by the user will be enabled; the others will not be. Let's see that with an example. As SYS user, execute the following:
SQL> alter user scr default role none;

User altered.
Now connect as SCR user and see which roles have been enabled:
SQL> connect scr/oracle
SQL> select * from session_roles;

no rows selected
None of the roles have been enabled. Why? That's  because none of the roles are default for the user (effected by the alter user statement by SYS). At this point when you select from a dynamic performance view:
SQL> select * from v$datafile;
select * from v$datafile
ERROR at line 1:
ORA-00942: table or view does not exist
You will get this error because the role is not enabled, or active. Without the role the user does not have any privilege to select from the data dictionary or dynamic performance view. To enable the role, the user has to execute the SET ROLE command:

Role set.
Checking the enabled roles:
SQL> select * from session_roles;


2 rows selected.

Now the roles have been enabled. Since the roles are not default, the user must explicitly enable them using the SET ROLE command. This is a very important characteristic of the roles. We can control how the user will get the privilege. Merely granting a role to a user will not enable the role; the user's action is required and that can be done programmatically. In security conscious environments, you may want to take advantage of that property. A user does not always have the to have to privilege; but when needed it will be able to do so.

The SET ROLE command is an SQL*Plus command. To call it from SQL, use this:


   dbms_session.set_role ('SELECT_CATALOG_ROLE');


You can also set a password for the role. So it will be set only when the correct password is given;

SQL> alter role SELECT_CATALOG_ROLE identified by l       
  2  /

Role altered.

To set the role, you have to give the correct password:
SQL> set role SELECT_CATALOG_ROLE identified by l;

Role set.
If you give the wrong password:
SQL> set role SELECT_CATALOG_ROLE identified by fl
  2  /
set role SELECT_CATALOG_ROLE identified by fl
ERROR at line 1:
ORA-01979: missing or invalid password for role 'SELECT_CATALOG_ROLE'
You can also revoke the execute privilege on dbms_session from public. After that the user will not be able to use it to set the role. You can construct another wrapper procedure to call it. Inside the wrapper, you can have all sort of checks and balances to make sure the call is acceptable.
We will close this discussion with a tip. How do you know which roles are default? Simply use the following query:
  2  from dba_role_privs
  3  where GRANTEE = 'SCR';

GRANTED_ROLE                   DEF
------------------------------ ---

Thanks to Randolph Geist ( and Pavel Ruzicka ( for pointing out yet another important difference. SELECT ANY DICTIONARY allows select from all SYS owner tables such as TAB$, USER$, etc. This is not possible in the SELECT_CATALOG_ROLE. This difference may seem trivial; but is actually quite important in some cases. For instance, latest versions of Oracle do not show the password column from DBA_USERS; but the hashed password is visible in USER$ table. It's not possible to reverse engineer the password from the hash value; but it is possible to match it to a similar entry and guess the password. A user with the system privilege will be able to do that; but a user with the role will not be.


In this blog entry I started with a simple question - what is the difference between two seemingly similar privileges - SELECT ANY DICTIONARY and SELECT_CATALOG_ROLE. The former is a system privilege, which remains active throughout the sessions and allows the user to create stored objects on objects on which it has privileges as a result of the grant. The latter is not a system grant; it's a role which does not allow the grantee to build stored objects on the granted objects. The role can also be non-default which means the grantee must execute a set role or equivalent command to enable it. The role can also be password protected, if desired.

The core message you should get from this is that roles are different from privileges. Privileges allow you to build stored objects such as procedures on the objects on which the privilege is based. Roles do not.

Systemstate Dump warning

Whilst investigating the latest of our many library cache contention problems on 11.2, I made the fatal mistake of relying on my previous experience combined with a standard Oracle Support note describing how to diagnose such problems.

When the system was apparently hung (although the reality was that one session was holding a library cache mutex or latch that other sessions were waiting for - hopefully more on that later), I decided a Systemstate dump would be a good idea. (On a side-note, I often prefer Hanganalyze but I suspect that's more to do with me finding the resulting trace file easier to read.)

What a mistake. Particularly as this was a Production system and I'd managed to talk one of the clients good DBAs into helping me out! As soon as the statement was run, the system went into a much worse state and, to cut a medium but too exciting story short, we ended up having to restart everything.

To make matters worse, another good colleague reminded me that he had sent out a warning to people that he'd hit the same experience earlier, due to bug 11800959 - "A SYSTEMSTATE dump with level >= 10 in RAC env dumps unnecessary huge BUSY GLOBAL CACHE ELEMENTS". In the usual email flurry, I'd put that whole conversation to one side.

So please don't make the same mistake as I did and ignore the warning - tread carefully before you start taking Systemstate dumps on 11.2!

OT: Norway

Having had one of my best conference experiences ever this year in the company of many good Norwegians, this weeks news saddened me and it must have come as a hell of a shock to the country as a whole, not just those involved. Their horror is difficult to imagine.

I really don't have adequate words, frankly.

At the same time, it inspired me to hear ...

"We must – and will – meet terror with more democracy, not less. We must not lock up Norwegian society. That would be piling tragedy upon tragedy."

My heart goes out to those involved and I look forward to going back to Norway next year ...

Who Manages the Exadata Machine?

For organizations that just procured an Exadata machine, one of the big questions is bound to be about the group supporting it. Who should it be - the DBAs, Sys Admins, Network Admins, or some blend of multiple teams?

The conventional Oracle database system is a combination of multiple distinct components - servers, managed by system admins; storage units, managed by SAN admins; network components such as switches and routers, managed by network admins; and, of course, the database itself, managed by the DBAs. Exadata has all those components - servers, storage (as cell servers), infiniband network, ethernet network, flash disks, the whole nine yards; but packaged inside a single physical frame representing a single logical unit - a typical engineered system. (For a description of the components inside the Exadata system, please see my 4-part article series on Oracle Technology Network) None of these conventional technology groups posses the skillsets to the manage all these components. That leads to a difficult but important decision - how the organization should assign the operational responsibilities.


There are two choices for organizations to assign administrative responsibilities.

  1. Distributed - Have these individual groups manage the respective components, e.g. Sys Admins managing the Linux servers, the storage admins managing the storage cells, network admins managing the network components and finally DBAs managing the database and the cluster.
  2. Consolidated - Create a specialized group - Database Machine Administrator (DMA) and have one of these groups expand the skillset to include the other non-familiar areas.

Each option has its own pros and cons. Let's examine them and see if we can get the right fit for our specific case.

Distributed Management

Under this model each component of Exadata is managed as an independent entity by a group traditionally used to manage that type of infrastructure. For instance, the system admins would manage the Linux OS, overseeing all aspects of it such as creation of users to applying the patches and RPMs. The storage and database would be managed likewise by the specialist teams.

The benefit of this solution is its seeming simplicity - components are managed by their respective specialists without a need for advanced training. The only need for training is for storage, where the Exadata Storage Server commands are new and specific to Exadata.

While this approach seems a nobrainer on surface, it may not be so in reality. Exadata is not just something patched up from these components; it is an engineered system. There is a huge meaning behind that qualifier. These components are not designed to act alone; they are put together to make the entire structure a better database machine. And, note the stress here - not an application server, not a fileserver, not a mail server; not a general purpose server - but a database machine alone. This means the individual components - the compute nodes, the storage servers, the disks, the flashdisk cards and more - are tuned to achieve that overriding objective. Any incremental tuning in any specific component has to  be within the framework of the entire frame; otherwise it may fail to produce the desired result, or worse, produce undesirable result.

For instance the disks where the database resides are attached to the storage cell servers; not the database compute nodes. The cell servers, or Cells run Oracle Enterprise Linux, which is very similar to Red Hat Linux. Under this model of administration, the system admins are responsible for managing the operating system. A system admin looks at the host and determines that it is under tuned since the filesystem cache is very low. In a normal Linux system, that would have been a correct observation; but in Exadata, the database is in ASM and a filesystem cache is less important. On the other hand, the Cells need the memory to place the Storage Indexes on the disk contents. Placing a large filesystem cache not only produce nothing to help the filesystem; but actually hurt the performance for the paging of Storage Indexes.

This is just one example of how the engineered systems are closely interrelated. Assuming they are separate and assigning multiple groups with different skillsets may not work effectively.

Database Machine Administrator

This is leads to the other approach - making a single group responsible for the entire frame from storage to the database. The single group would be able to understand the impact of the changes in one component to the overall effectiveness of the rack and will be in a better position to plan and manage. The single role that performs the management of Exadata is known as Database Machine Administrator (DMA).

I can almost hear the questions firing off inside your brain. The most likely question probably is whether it is even possible to have a single skillset that encompasses storage, system, database and network.

Yes, it definitely is. Remember, the advantages of an engineered system do not stop at being a carefully coordinated individual components. Another advantage is the lack of controls in those components. There are less knobs to turn on each component in an Exadata system. Take for instance the Operating System. There are two types of servers - the compute nodes and the cells. In the cells, the activity performed by a system admin is severely limited - almost to the point of being none. On the compute nodes, the activities are limited as well. The only allowable activities are - setting up users, setting up email relays, possibly setting up an NFS mount and handful of more. This can easily be done by a non-expert. One does not have to a System Admin to manage the servers.

Consider storage, the other important component. Traditionally storage administrators perform critical functions such as adding disks, carving out LUNs, managing replication for DR and so on. These functions are irrelevant in Exadata. For instance, the disks are preallocated in Exadata, the LUNs are created at installation time, there is no replication since the DR is by Data Guard which at the Oracle database level. One need not be a storage expert to the perform the tasks in Exadata. Additionally the Storage Admins are experts in the specific brand of storage, e.g. EMC VMax or IBM XiV. In Exadata, the storage is different from all the other brands your storage admins may be managing. They have to learn about the Exadata storage anyway; so why not have someone else, specifically the DMA learn?

Consider Network. In Exadata the network components are very limited since it is only for the components inside the rack. This reduces the flexibility of the configuration compared to a regular general purpose network configuration. the special kind of hardware used in Exadata - Infiniband - requires some special skills which the network ops folks may have to learn anyway. So, why not the DMAs instead of them? Besides, Oracle already provides a lot of tools to manage this layer.

That leaves the most visible component - the database which is, after all, the heart and soul of Exadata. This layer is amenable to a considerable degree of tuning and the depth of skills in this layer is vital to managing Exadata effectively. Transferring the skills needed here to a non-DBA group or individual is difficult, if not impossible. This makes the DBA group the most natural choice for evolving into the DMA role after absorbing the relevant other skills. The other skills are not necessarily at par with the administrator of the respective components. For instance the DMA does not need to be a full scale Linux system admin; but just needs to know a few relevant concepts, commands and tools to perform the job well. Network management is Exadata is a fraction of the skills expected from a network admin. The storage management in cell servers are new to any group; so the DMA will find that as easy as any other group, if not easier.

By understanding the available knobs on all the constituent components of Exadata, the DMA can be better prepared to be an effective administrator of the Exadata system; not by divvying up the activities to individual groups which are generally autonomous. The advantages are particularly seen when troubleshooting or patching Exadata. Hence, I submit here for your consideration - a new role called DMA (Database Machine Administrator) for the management of Exadata. The role should have the following skillsets:

60% Database Administration
20% Cell Administration
15% Linux Administration
5% Miscellaneous (Infiniband, network, etc.)

I have written an article series on Oracle Technology Network - Linux for Oracle DBAs. This 5-part article series has all the commands an concepts the Oracle DBA should understand about Linux. I have also written a 4 part article series - Commanding Exadata - for DBAs to learn the 20% cell administration. With these two , you will have everything you need to be a DMA. Scroll down to the bottom of this page and click on "Collection of Some of My Very Popular Web Articles" to locate all these articles and more.


In this blog entry, I argued for creating  a single role to manage the Exadata system instead of multiple groups managing individual parts. Here are the reasons in a nutshell:

  1. Exadata is an engineered system where all the components play collaboratively instead of as islands. Managing them separately may be ineffective and detrimental.
  2. The support organizations of components such as Systems, storage, DBA, etc. in an organizations are designed with a generic purpose in mind. Exadata is not generic. Its management needs unprecedented close coordination among various groups which may be new to the organization and perhaps difficult to implement.
  3. The needed skillsets are mostly database centric; other components have very little to manage.
  4. These other skills are easy to add to the DBA skills making the natural transition to the DMA role.

Best of luck in becoming a DMA and implementing Exadata.

Toolbar Buttons Revisited…

My 9 year old nephew was making a Powerpoint slideshow today. I watched him regularly saving his slideshow using the menu and asked why he was not using the Save button, to which he replied, “Which one is it?”.

I looked at the toolbar and saw a button with a picture of a floppy disk. I don’t think he has ever seen a floppy disk in his life. I’m not surprised he didn’t associate this button with saving his slideshow.

I’ve seen articles suggesting that buttons with icons are not good for new users. This is especially true if the icons reference old technology they have never encountered.






date conversions

Dates are a surprising pain for correlating data an at the same time being understandable to the human reader.

For a generic form we can take seconds since 1970, on UNIX, with

$ date '+%s'

but then how to we convert it back? There is a neat trick for that, “date -d @”

$ date -d @1311372938  +%H:%M:%S

so in scripts we can load up the date into seconds, that can be subtracted to get deltas and at the same time converted easily to any date format :

date_secs=` date '+%s'`
date -d @$date_secs +%H:%M:%S



AWK takes a bit of a trick to convert seconds to human format:

# setup the UNIX command
d="date -d @"date_secs" +%H:%M:%S"
# run the command to getline and put it into "date_human_format"
d | getline date_human_format
# be sure and close the open descriptor


Dtrace will output nano seconds since 1970 in “walltimestamp” function, so just lop off the nanoseconds and you can convert it to date




Oracle  has the nice feature of returning the days along with fractional days when subtracting two dates, so it seems like just subtracting off 1970 should work and then multiplying by the seconds in a day

SELECT (SYSDATE - to_date('01-JAN-1970','DD-MON-YYYY')) * (86400) from dual;

But since Oracle can already subtract dates there is little need to do the conversion unless extracting the data to be used by an external consumer.

If Oracle is reading data from an external source in seconds since 1970 we can just reverse the process

select to_char((TO_DATE('01-JAN-1970','DD-MON-YYYY') + seconds_since_1970 / 86400 ),'DD-MON-YYYY') from dual;


SQL> select to_char((TO_DATE('01-JAN-1970','DD-MON-YYYY') +1311372938  / 86400 ),'DD-MON-YYYY') from dual;


SQL> SELECT (SYSDATE - to_date('01-JAN-1970','DD-MON-YYYY')) * (86400) from dual;




One issue on Oracle though, the variable type timestamp can be a bit frustrating to work with, but it can be cast into a date, for example:

select cast(begin_interval_time as date ) from DBA_HIST_SNAPSHOT;


UPDATE: just found this link which is pretty good at covering a wide range of languages and databases:

though the didn’t include AWK :)  – who uses AWK these days anyway?

Pythian Tools: Method R Profiler, MR Tools & MR Trace

Working with 100 talented database engineers is fun and there are lots going on — lots of exciting (and not so much) projects ongoing, huge amount of problems solved, mistakes made (and learned from), many unique (as well as routine) customer needs satisfied, many new (and old) methods applied, many good (and less so) tools [...]