Search

OakieTags

Who's online

There are currently 0 users and 35 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Review: The Method-R Profiler

I was lucky enough to get my hands on the Method-R Profiler and had a test drive with it. So first of all what is it? As you might expect a profiler, a profiler for Oracle performance problems. The tool makes use of Oracle’s trace facilities. If you have a performance problem with the database …

Continue reading »

IOT Part 6(B) – OLTP Inserts into an IOT

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
<…………IOT6(A) – Bulk Insert slowed down

IOT 6 part A covered the sort of insert work used to create an IOT or push a large number of rows into it via an “insert into…select from” type format. ie Bulk Load.
In this, part B, I am going to test inserting a small number of rows (10,000) into a table already holding a million rows. Something more like an hourly load or the inserts that would occur over a working day – a more OLTP-like test

I’m going to do the tests against four things. A “non-insert” test to ensure my data creation is not taking a lot of resource {the Null test}, then against my three versions of test table – a normal heap table, my IOT and my Partitioned IOT. I run each test on each segment type three times, the minimum number needed to help identify variation that could invalidate the test.

You will probably be surprised by the results….

All three tables have the same column structure, the heap has indexes on the primary key and a unique constraint, the IOT and partitioned IOT are organized on a modified version of the Primary key and have the second index for the unique key. Below is the creation statement for the partitioned IOT (the IOT is the same without the partition clause). You can get full creation details from IOT part 2:

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(1000) not null
,vc_2      varchar2(1000)
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110901 values less than (to_date('01-09-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20111001 values less than (to_date('01-10-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20111101 values less than (to_date('01-11-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)

The three tests are:

  • Insert 10,000 records for random accounts, in time order for the same day. This would be like processing data for or during a “day”
  • Generating the 10,000 records in order of account, to better match the IOT structure, again for a single day. For any given account 0,1,2 or 3 records are created, to be a little more “real life”. This would be like loading a pre-processed set of data.
  • All 10,000 records are created for the one account, in datetime order, as an extreme example of inserting into an IOT data that is ordered to perfectly match the order of the IOT.

The code is written ‘badly’, ie a single record is prepared and inserted into the table at a time, within a loop of 10000 iterations. This is to avoid the bulk-processing that may benefit the heap table and also to better match the reality of an OLTP system. I do however only commit after all records are created for any given test, which is maybe a mistake but that’s what I did.

Now, before I go any further, does anyone want to guess by how much inserting into the heap table will be faster than inserting into the IOT (partitioned or otherwise)? Also, whether the partitioned IOT will be faster or slower than the IOT and what, if any, impact there will be by the ordering of data?

To allow you to think about that and not see the results by accident, I’ll stick in one of the insert harnesses here so you can see how I am doing it. Skip over if you just want the results

--
execute s_snap.my_snap(' finished non-insert test1')
--
-- Transaction_heap random data test
declare
v_num         number :=10000; -- number of accounts
v_offset      number :=-4;
v_str1        varchar2(60);
v_str2        varchar2(60);
v_tran_type   number(2);
v_tran_id     number(10);
v_acco_type   number(2);
v_acco_id     number(10);
v_cre_date    date;
v_vc_1        varchar2(1000);
v_vc_2        varchar2(1000);
v_date_1      date :=sysdate;
v_num_1       number(2) :=10;
v_num_2       number(2) :=15;
begin
v_str1:=dbms_random.string('U',60);
v_str2:=dbms_random.string('U',60);
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..10000 loop
    v_tran_type :=mod(1,3)+1;
    v_tran_id   := 1000000+i;
    v_acco_type := 5+(trunc(dbms_random.value(1,3))*5); -- 10 or 15
    v_acco_id   :=trunc(dbms_random.value(1,v_num/2)); -- 1 to 5000
    v_cre_date  :=trunc(sysdate-v_offset)+(i/(60*60*24));
    v_vc_1      :=substr(v_str1,1,51+mod(i,10));
    v_vc_2      :=substr(v_str2,1,46+mod(i,15));
    insert into transaction_heap
      (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,vc_2,date_1,num_1,num_2)
    values
      (v_tran_type,v_tran_id,v_ACCO_type,v_ACCO_id,v_cre_date
      ,V_vc_1,v_vc_2,v_date_1,v_num_1,v_num_2);
end loop;
commit;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
--
execute s_snap.my_snap(' finished th insert test1')
--

Here are the results of three runs for all tests:

Elapsed time in seconds:

                    random insert             ordered-by-account       perfect IOT order match
Segment type    Test 1    2    3            Test 1    2    3           Test 1    2     3    
------------    ------ ---- ----            ------ ---- ----            ----- ---- ----
No insert          2.1  0.1  0.1               0.1  0.1  0.1              0.1  0.0  0.1   

                                    AVG                         AVG                        AVG
                                   ----                        ----                       ----
Heap              52.3 20.5 11.4   28.1       33.4 13.8 10.7   19.3       9.1 13.0  9.8   10.6
IOT              137.1 83.7 26.4   82.4      110.8 63.6 39.0   71.2       6.4  6.7  4.0    5.8
Partn'd IOT       33.2  9.5  8.7   17.1       17.1 10.4 14.3   13.9       8.1  5.3  4.5    6.0
                                   ----                        ----                       ----

Allow me to summarise that.

First, the preparation of the data and cycling around the loop, the no-insert tests, are negligible. {The first no_insert test is taking 2 seconds due to the parsing and collecting dictionary information overhead of the first thing done}. We can ignore the overhead of the test harness.

Much more significantly:

  • Each iteration of each test, things get faster {with one odd exception}
  • For Random and semi-ordered inserts, inserting into the IOT is slower than the Heap by a factor of 300% to 400% (3 times to 4 times slower) – what we all probably expected.
  • Inserting into the IOT is Faster than the heap when the insert order matches the structure of the IOT in the perfect_IOT test. Not just the same but 40% faster
  • The partitioned IOT is as Fast or Faster than the the Heap table, never mind the IOT. Over all the partitioned IOT takes 50% TO 75% THE TIME OF THE HEAP to be populated.

Hands up everyone expecting this? NB My hand would be down if I had not done similar tests before.

Why do we see these results?

I’ll just mention that the reason all the tests get quicker with each iteration is that the computer is doing the same thing over and over again and gets used to it, the relevant circuits in the chip are “oiling in”. Of course not, how silly – but it seems to be how non-IT people think of it :-) Basically more and more of each segment is getting cached and that reduces physical IO and speed things up. You see this a lot in testing and you need to be constantly vigilant about it. You might think I should really run the tests until they are steady-state, but that is a very false situation. On real system you don’t tend to get the same thing run over and over and over again on a system with nothing else running.

Much more interestingly, why is the partitioned IOT winning? Well, for the Heap table data is being poured into the leading edge of the table, which takes a little processing and some blocks to be created. However, there are also two indexes being maintained, the Primary Key and that Unique key. For each insert oracle has to find the right spot in each index (root block ->branch block->leaf block->check for row) and insert the new record, occasionally splitting a full leaf block.

For the IOT and the Partitioned IOT there is no table to maintain. In effect there are just two indexes to maintain. For the normal IOT this benefit is outweighed by the cost of working with a less space efficient Primary Key index. As there is the whole of the row in each index entry there are fewer index entries possible per leaf block and so more block need to be read in from disc and more splits are needed. Thus the inserts into the IOT are slower.
But for the partitioned IOT not only is there no table to maintain but the table (really the Primary Key) is partitioned – on date. We are creating data for a single day at a time. All the relevant blocks will be in the one partition (and it’s the latest and smallest partition) and quickly all cached in memory and processed in memory. It’s the efficient use of the cache again. For both the IOT and the Heap, the primary key segment being processed is larger and the inserts are spread across the whole index.

Ordering data to match the index order (ordered_by_account tests) helps for all three segment types – heap, IOT and partitioned IOT. We have two indexes and the data is being ordered for both in effect, as one is on the columns we are ordering the data creation for and the other, the unique key, is generated by an ascending sequence. The gain is about 25, 30% and is due to working through the indexes in order as opposed to skipping about. If the indexes were significantly larger than the buffer cache or on a highly active system where the turnover in the cache was very high (indeicated by a low buffer cache hit ratio :-) ) then the benefit would be higher.
The perfect ordering test (perfect_iot_order_match) is a special case. Not only is the data being processed in order of the indexes but only one part of the index is being updated – for a single account. In this situation the maintaining of the index is so efficient that the time saved not processing the heap table means both the IOT and IOT_P win out over the heap. So in this special case, inserts into IOTs are faster than inserts into a heap table.

And finally a quick summary:

  • Inserting records into an IOT carries about a 200-400% overhead, depending on the number of indexes
  • Ordering the data to match the IOT gives a 25-30% advantage, but it similarly aids update of the primary key on the heap
  • Partitioning the IOT such that a smaller chunk of the index is processed can be highly beneficial. I don’t cover it specifically in the above but you can probably appreciate that partitioning on a heap can be beneficial to processing indexes on the heap also.
  • If you can organise to create your IOT such that data is created only in one or a few distinct points in the structure (you might have to tweak you data load too of course), insert performance might match or even beat that for a heap table with the same indexes
  • Using methods to maximise efficient use of your buffer cache and reduce IO is a very powerful performance tuning aid, not just for select

BTW I am conscious of the fact that the above may surprise many and you might wonder if my data or tests are flawed. I have stripped out most of the code from this post as it was just becoming massive and confusing, but contact me and I will happily supply all the example scripts and data.

EMlight

I’ve been meaning to blog about EMlight (http://obzora.com) for a while.  I first tried it about 6 months ago, and was impressed.  EMlight  is a light weight  re-write of Enterprise Manager that is dependable and improved. How is it improved? Let’s look at my favorite part, the “Top Activity” screen. The “Top Activity” screen has added a section on I/O latencies which is sorely missing from OEM’s performance page and Top Activity page. The OEM light version also includes  a list of the top events (as I did in DB Optimizer).  Very cool.

How is EMlight more dependable? With EMlight you just connect to the database and voila it works. It doesn’t require any of the brittle configuration files that OEM depends on nor the agents and processes that OEM depends on.

Here is an example of the To Activity screen in EMlight:

I tested version 1.2. Since then versions 1.3 and 1.4 have been released. I’ve meant to test these but haven’t yet and didn’t want to wait any longer before posting about this promising tool.

Just gave a quick look at 1.4 and looks pretty much the same but obvious improvements such as the “top activity” screen above now has a super responsive slider window to choose the aggregation time range, and you might notice in the above screen shot that CPU is missing in two of the 3 aggregates which is now fixed in 1.4 (I think the fix actually made it into 1.3 originally)

Check out more screen shots here http://obzora.com/screenshots.html

 

Here are a few other screen shots fro my testing:

Datafiles along with single block I/O latency which is a nice addition

 

I/O along with I/O latency (I don’t think OEM has every added I/O latency )


Buffer cache contents and advisory:

 

 

Archive file date times sizes and rates

 

Tables space usage and attributes
Top  Session, PL/SQL, SQL, Events  (kind of like the top activity page without the timeline)

 

I really liked this one (the followup)

I recently wrote about an interesting question (not Oracle related!).  I found the question interesting because it reminds me of so many questions I actually receive about Oracle.  They are vague, incomplete, ambiguous, confusing at times.

The question I referred you to fell squarely in that category.  It was:

If you choose an answer to this question at random, what is the chance you will be correct?
a) 25%
b) 50%
c) 60%
d) 25%

They are missing the correct answer in the list!  And adding the correct answer would change the answer - so that it might be incorrect as well.  Meaning - there is more missing here than supplied.  You don't stand a chance of answering the question at all.

I put forth that the answer is "e) none of the above" - making the right answer "20%".

Looking at the problem, if you assume the right answer is 25% - because you have a 1 in 4 chance of selecting any one of the items - the probability you will get the right answer randomly is 50%.  So, it cannot be 25% - if it were - it would be 50% (confusing, yes).

Moreover, if you assume 50% is correct, then the right answer is 25% - because the chance of picking 50% is 1 in 4.

If you assume 60% is correct - well, I don't know what to say.

The question itself seems to somehow affect the answer - the act of answering the question changes the answer.  Sort of like when you run a program in the debugger and it works perfectly - but when you run it standalone it crashes (that has never happened to me, nope, not a chance...).

So, my answer is "e) none of the above"

Fedora 16 and Oracle…

Fedora 16 came out yesterday and since it’s my main server OS it’s been upgrade crazy round here. All new installs and upgrades were straight forward. No real dramas at all (touch wood).

As usual, I’ve done the OS installation and Oracle installation articles.

Remember, installing Oracle on Fedora is just for fun. There is no real need to do it because you can use Oracle Linux for free and the latest version of VirtualBox has a Fedora 16 rpm, so there is no reason not to use Oracle Linux.

Cheers

Tim…




Profiling trace files with preprocessor external tables in 11g

Generate and read TKProf and OraSRP reports in a single SQL*Plus window (or other IDE of choice) using preprocessor external tables. November 2011

Importing AWR repositories from cloned databases

I sometimes want to import AWR repositories from cloned databases to compare performance differences between the databases which should have the same performance characteristics. Unfortunately the clones have the same DBID this importing their AWR repositories into a central AWR repository would be probelematic as I wouldn’t have a way to tell which database the data came from as the DBID is the same. So here is a procedure to change the DBID. THe procedure requires importing the data into the awr_stage user, changing the DBID, then moving the data into the AWR repository:
(see http://gavinsoorma.com/2009/07/exporting-and-importing-awr-snapshot-data/ for info on normal exporting and importing of AWR repositories)

The script uses a  awr_change_dbid.sql to defer the constraint on DBID, change the DBID and then put the constraint back

   create tablespace AWR datafile '/home/oracle/oradata/AWR_01.dbf' size 200M;
   Drop Directory AWR_DMP;
   Create Directory AWR_DMP AS '/home/oracle/awr_dumps';
-- create staging user user
   drop user awr_stage cascade;
   create user awr_stage
     identified by awr_stage
     default tablespace awr
     temporary tablespace temp;
   grant connect to awr_stage;
   alter user awr_stage quota unlimited on awr;
   alter user awr_stage temporary tablespace temp;
-- load data
   begin
     dbms_swrf_internal.awr_load(schname  => 'AWR_STAGE',
 				 dmpfile  => 'my_awr_dump', -- file w/o .dmp extension
                                 dmpdir   => 'AWR_DMP');
   end;
/
-- change dbid
   @awr_change_dbid
   commit;
-- move data
   def schema_name='AWR_STAGE'
   select  '&schema_name' from dual;
   variable schname varchar2(30);
   begin
     :schname := '&schema_name';
     dbms_swrf_internal.move_to_awr(schname => :schname);
   end;
/
-- verify the new DBID in repository
   col host_name for a30
   select distinct dbid,  db_name, instance_name, host_name from
   dba_hist_database_instance;

 

IOT P6(a) Update

In my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, much slower than on a larger system I had used for professional testing. A major contributing factor was that the insert was working on the whole of the IOT as data was created. The block buffer cache was not large enough to hold the whole working set (in this case the whole IOT) once it grew beyond a certain size. Once it no longer fitted in memory, Oracle had to push blocks out of the cache and then read them back in next time they were needed, resulting in escalating physical IO.

I’ve just done another test which backs up this claim. I altered my test database so that the block buffer cache was larger, 232MB compared to 100MB in my first tests. The full IOT is around 200MB

Bottom line, the creation of the IOT was greatly sped up (almost by a factor of 4) and the physical IO dropped significantly, by a factor of 20. As a result, the creation of the IOT was almost as fast as the partitioned IOT. It also shows that the true overhead on insert of using an IOT is more like a factor of 2 to 4 as opposed 6 to 8.

You can see some of the details below. Just to help you understand them, it is worth noting that I had added one new, larger column to the test tables (to help future tests) so the final segments were a little larger (the IOT now being 210MB as opposed to 180MB in the first tests) and there was a little more block splitting.

                        Time in Seconds
Object type           Run with       Run with
                     100MB cache    232MB cache
------------------  ------------    -----------   
Normal Heap table          171.9          119.4   
IOT table                1,483.8          451.4     
Partitioned IOT            341.1          422.6 

-- First reading 100MB cache
-- second reading 232MB cache 
STAT_NAME                            Heap    	IOT	      IOT P
-------------------------------- ---------- -----------  ----------
CPU used by this session            5,716         7,222       6,241
                                    5,498         5,967       6,207

DB time                            17,311       148,866      34,120
                                   11,991        45,459      42,320

branch node splits                     25            76          65
                                       25            82         107

leaf node 90-10 splits                752         1,463       1,466
                                      774         1,465       1,465

leaf node splits                    8,127        24,870      28,841
                                    8,162        30,175      40,678

session logical reads           6,065,365     6,422,071   6,430,281
                                6,150,371     6,544,295   6,709.679

physical read IO requests             123        81,458       3,068
                                      36          4,012       1,959

physical read bytes             2,097,152   668,491,776  25,133,056
                                1,400,832    34,037,760  16,048,128

user I/O wait time                    454       139,585      22,253
                                       39        34,510      19,293

The heap table creation was faster with more memory available. I’m not really sure why, the cpu effort was about the same as before and though there was some reduction in physical IO with the larger cache, I suspect it might be more to do with both the DB and the machine having been recently restarted.

All three tests are doing a little more “work” in the second run due to that extra column and thus slightly fewer rows fitting in each block (more branch node and leaf node splits), but this just highlights even more how much the IOT performance has improved, which correlates with a massive drop in physical IO for the IOT creation. If you check the session logical reads they are increased by a very small, consistent amount. Physical read IO requests have dropped significantly and, in the case of the IOT, plummeted.

I believe the 90:10 leaf node splits are consistent as that will be the maintaining of the secondary index on ACCO_TYPE and ACCO_ID, which are populated in order as the data is created (derived from rownum).

What this second test really shows is that the efficiency with which you are able to make use of the database cache is incredibly significant. Efficiently accessing data via good indexes or tricks like IOTs and hash tables is important but it really helps to also try and consider how data is going to be recycled within the cache or used, pushed out and then reused. A general principle for batch-type work seems to me to be that if you can process it in chunks that can sit in memory, rather than the whole working set, there are benefits to be gained. Of course, partitioning can really help with this.

{If anyone is wondering why, for the heap table, the number of physical IO requests has dropped by 70% but the actual number of bytes has dropped by only 30%, I’m going to point the finger to some multi-block read scan going on, either in recursive code or, more likely, my code that actually gathers those stats! That would also help explain the drop in user IO wait time for the heap run.}

Just for completeness, here is a quick check of my SGA components for the latest tests, just to show I am using the cache size I claim. All of this is on Oracle 11.1 enterprise edition, on a tired old Windows laptop. {NB new laptop arrived today – you have no idea how hard it has been to keep doing this blog and not play with the new toy!!!}. If anyone wants the test scripts in full, send me a quick email and I’ll provide them.:

-- sga_info.sql
-- Martin Widlake /08
-- summary
set pages 32
set pause on
col bytes form 999,999,999,999,999 head byts___g___m___k___b
spool sga_info.lst
select * 
from v$sgainfo
order by name
/
spool off
clear col
--
NAME                             byts___g___m___k___b RES
-------------------------------- -------------------- ---
Buffer Cache Size                         243,269,632 Yes
Fixed SGA Size                              1,374,892 No
Free SGA Memory Available                           0
Granule Size                                4,194,304 No
Java Pool Size                              4,194,304 Yes
Large Pool Size                             4,194,304 Yes
Maximum SGA Size                          401,743,872 No
Redo Buffers                                6,103,040 No
Shared IO Pool Size                                 0 Yes
Shared Pool Size                          142,606,336 Yes
Startup overhead in Shared Pool            50,331,648 No
Streams Pool Size                                   0 Yes

Troubleshooting ‘DFS lock handle’ waits

Waits for ‘DFS lock handle’ can cause massive performance issues in a busy RAC cluster. In this blog entry, we will explore the DFS lock handle wait event, and understand how to troubleshoot the root cause of these waits. I am also going to use locks and resources interchangeably in this blog, but internally, they are two different types of structures.

A little background

DFS (stands for Distributed File System) is an ancient name, associated with cluster file system operations, in a Lock manager supplied by vendors in Oracle Parallel Server Environment (prior name for RAC). But, this wait event has morphed and is now associated with waits irrelevant to database files also. Hence, it is imperative to understand the underlying details to debug the ‘DFS lock handle’ waits.

How does it work?

I have no access to the code, so read this paragraph with caution, as I may have misunderstood my test results: A process trying to acquire a lock on a global GES resource sends a AST(Asynchronous Trap) or BAST (Blocking Asynchronous Trap) message to LCK process, constructing the message with (lock pointer, resource pointer, and resource name) information. If the resource is not available, then the LCK process sends a message to the lock holder for a lock downgrade.

Now, let’s imagine that we need to ask remote background process to do some operations, say, DBWR to do an object checkpoint in a remote instance. Further, the requestor must wait for the background process to complete the task. What better way is there to implement these waits, than using a lock? It seems that acquiring a lock on a specific resource in a specific mode, triggers a predefined action in that background process. For example, acquiring locks on a specific CI resource triggers the DBWR to object level checkpoint (details later).

While the background process is working, the requesting process is instrumented to wait on the wait event ‘DFS lock handle’, in essence, the requesting process is waiting for one or more background process to perform a specific action. BTW, in some cases, background processes also can wait for ‘DFS lock handle’. Excessive waits by the foreground or background process for the DFS lock handle wait event can lead to instance freeze or application freeze.

Every resource has a value block and can be used to send more information about the resource, such as object_id for a sequence, object_id for a table partition etc.

Type of Lock

First, we need to identify the type of lock associated with this event. Fortunately, type|event in the trace file (or p1 column in v$session when the event waited is DFS lock handle) has this information coded and can be retrieved using the following SQL statement.


From a trace file:
nam='DFS lock handle' ela= 4362 type|mode=1398145029 id1=86033 id2=0 obj#=-1 tim=23065806692

SQL> select chr(bitand(&&p1,-16777216)/16777215) ||
chr(bitand(&&p1,16711680)/65535) type,
mod(&&p1, 16) md
from dual
/
TY MD
-- ----------
SV 5

Also
1398145029 = 0x53560005; 53 is ASCII for S and 56 is ASCII for V.

Now, we know that, lock type is SV and we can get more details about this lock type accessing v$lock_type view. SV enqueue is acquired to co-ordinate the retrieval values from sequences created with order and cache. There are many other enqueue types though, this is just an example


SQL> exec rs.print_table
('select * from v$lock_type where type=''SV''');
TYPE : SV
NAME : Sequence Ordering
ID1_TAG : object #
ID2_TAG : 0
IS_USER : NO
DESCRIPTION : Lock to ensure ordered sequence allocation in RAC mode
-----------------

Resource name

In Global Enqueue Services layer, each resource is uniquely identified by a resource name. For example, for the sequence with object_id=86033, resource name is [0x15011][0x0],[SV][ext 0x0,0x0]. As you see, this resource name is essentially, [id1][id2],[lock type][extension]. In this example, id1=86033=0×15011. Also, p2, p3 columns in v$session for this DFS lock handle wait event is enumerating id1 and id2 columns of the resource. As you probably guessed, many instance level locks are not visible in gv$lock family of views, of course, gv$lock is a global view of local locks and gv$ges_enqueue is global view of global locks :-)

Let’s probe further to understand few lock types commonly encountered.

Enqueue: SV

If the enqueue type is SV, then that enqueue is used to co-ordinate sequence values in a RAC cluster, for sequences declared with order and cache attributes. For the SV locks, id1 column will give you the object# of the sequence. In the example line below


nam='DFS lock handle' ela= 4362 type|mode=1398145029 id1=86033 id2=0 obj#=-1 tim=23065806692

id1 is set to 86033. That’s the object_id of a sequence declared with cache and order. Sequences with these attributes require the values to be in ORDER. As we know that in RAC, it is likely to have sequences return values in an order not correlated with time, due to the fact that each instance will cache sequence values. With ORDER attribute, we specifically require these values to be in a strict ascending/descending order even if the sequences are accessed in different instances. To accommodate this requirement, RAC code must ensure that next value retrieved from the sequence will be in an ORDER in any instance and the co-ordination mechanism is by transferring couple of GES messages between the instances. Master node of the GES resource for that sequence object will keep track of the latest value. It looks like, Value block of the GES resource is updated to reflect the latest sequence value.

You might be wondering what if there is more than a sequence. How is it handled? Each sequence gets its own resource. For example, the resource name for sequence with object_id=86033 will be [0x15011][0x0],[SV] and another sequence with object_id=86034 will be maintained by a resource [0x15012][0x0],[SV] etc. Value block of these resources will contain the latest value of the sequence and kept in the master node of this resource.( As a shameless plug, I go in to much details about GES and GRD in my Advanced RAC troubleshooting class. )

Of course, if the wait event is DFS lock handle, and if the enqueue is SV type, then the solution probably would be that, ORDER attribute is not a suitable attribute for that heavily accessed sequence. My recommendation is always been (a) Use bigger cache >1000 for the sequences frequently accessed, hundreds per second (b) Use default cache of 20 if the sequence is not frequently accessed (c) if you really require ORDERed values, know that this could be a problem in RAC, and use ORDER cache (d) Only in exceptional situations use nocache.

Enqueue: CI

CI stands for Cross Invocation, and mainly used to trigger an action in a background process such as DBWR, CKPT process etc, in a remote instance. id1 and id2 columns of this CI enqueue has special significance. Depending upon the values of these id1,id2 fields a specific resource is manipulated to trigger a specific action in a background process.

For example, truncating a table leads to foreground process marking a segment as temporary and requesting the background process to clean up the segments from their cache. Essentially, a co-ordination mechanism to ensure that all buffers about a segment is flushed out by DBW. This operation is triggered by acquiring a GES resource of CI type in a specific format: [0xd][0x2],[CI][ext 0x0,0x0]. From the GES resource name [0xd][0x2],[CI], we can decode that lock name is CI, id1=0xd=13, id2=0×2=2 (CI-13-2).

You can see the process holding this specific resource in gv$ges_enqueue;


select resource_name1 ,inst_id, pid,blocker, blocked from gv$ges_enqueue where resource_name1 like '%[0xd][0x2],[CI]%' and pid!=0
SQL> /

RESOURCE_NAME1 INST_ID PID BLOCKER BLOCKED
------------------------------ ---------- ---------- ---------- ----------
[0xd][0x2],[CI][ext 0x0,0x0] 1 1481 0 0
[0xd][0x2],[CI][ext 0x0,0x0] 2 1486 0 0

In this case, PID 1486 is DBW0
$ ps -ef |grep 1486|grep -v grep
oracle 1486 1 0 14:23:36 ? 0:02 ora_dbw0_solrac2

In fact, similar resource with a different id2 combinations are accessed prior to waiting for id2=2. For example, resources, [0xd][0x1],[CI], [0xd][0x3],[CI] are accessed prior to acquiring the resource [0xd][0x2],[CI] for a partition. Value block of the resource [0xd][0x1],[CI] is used to send the object_id of the segments. First two resources ([0xd][0x1],[CI], [0xd][0x3],[CI] ) are accessed as AST(Asynchronous Trap), and the resource [0xd][0x2],[CI] is accessed with BAST (Blocking Asynchronous Trap) per partition (or segment).

If you encounter high number of waits for ‘DFS lock handle’ with CI enqueue, then find the object_id of the event. Find the object details from the object_id, and understand the activity on that segment. Of course, excessive truncate and DDL statement on objects can cause issues in RAC due to global lock activity.

There are few more examples of CI locks. For example, starting parallel query servers in an inter-instance parallelism will trigger few access to CI locks. Three access to the resource to allocate one parallel slave is printed below: Essentially, resource is [0xa][0x1],[CI] for the first wait below.


nam='DFS lock handle' ela= 554 type|mode=1128857605 id1=10 id2=1 obj#=79568
nam='DFS lock handle' ela= 523 type|mode=1128857605 id1=10 id2=3 obj#=79568
nam='DFS lock handle' ela= 129970 type|mode=1128857605 id1=10 id2=2 obj#=79568

Note 34631.1 documents some of the meaning for id1 and id2 column. For example, id1=13 “drop sort segments”. I am not certain this note is accurate enough for 11.2 though.

Enqueue: IV

IV enqueues are acquired for Invalidation Lock. When a DDL is encountered on an object, then the dependent objects need to be invalidated. For example, cursors (SQL statements) that have dependency on a table need to be invalidated so that next execution of that cursor will reparse the cursor, after a DDL on that table. (Of course, I do understand that not all DDL statements will invalidate the cursors).

only the permanent objects such as tables and indexes are globalised and the SQL statements need not be protected by global resources. SQL statements are in turn invalidated by the dependency mechanisms on the dependent objects. For example, if you alter a table to drop a column, then that DDL statement will trigger invalidation of dependent objects of that table from all instances, an indirect invalidation of cursors from all instances occurs. In some cases, library cache locks and pin waits can occur due to indirect dependency validation mechanisms.


nam='DFS lock handle' ela= 414 type|mode=1230372869 id1=1398361667 id2=59 obj#=82512 tim=11542441219

In the example above, object# 82512 is an associated with table partition. Essentially, all dependent object of that partition must be invalidated from all instances. There are potential bugs in this area, for example, IV waits are higher for shared server connections (bug 8215444)

Enqueue: BB

XA enabled applications can connect to more than one instance in RAC from 11.2 onwards. Prior to 11.2, XA transaction can only connect to just one instance of RAC. GTX processes implement these global transactions and due to bug 13003033, GTX process waits for BB enqueue longer, leading to more DFS lock handle waits in 11.2. Only workaround at this point is to disable global transactions (11.2.0.2).

Enqueue related to ASM

There are many ASM related operations that use DFS lock handles to co-ordinate the operations. For example, ASM map operations when a file extended or added will trigger actions in the other nodes via DFS lock handle mechanism.


Here is the list of ASM based waits:
38 ASM diskgroup discovery wait
39 ASM diskgroup release
40 ASM push DB updates
41 ASM add ACD chunk
42 ASM map resize message
43 ASM map lock message
44 ASM map unlock message (phase 1)
45 ASM map unlock message (phase 2)
46 ASM generate add disk redo marker
47 ASM check of PST validity
48 ASM offline disk CIC

Summary

DFS lock handle is a wait event is associated with many different areas of database software code. It is important to identify the lock type, id1,id2 and understand the problem to resolve the root cause. Of course, many of these issues can be a new or already-identified software bugs too.

Installing FreeNX on OpenSuSE 11.4

After reading an article in one of my favourite computer magazines about FreeNX and NoMashine’s NX I was very interested to get this to work. Also, google are using NX for some developers-and if a technology is good enough for google than it can only be good enough for me as well.

Unfortunately there wasn’t an awful lot of documentation around for openSuSE 11.4 but by making best use of search engines I finally got it to work. Again, this is looking rather trivial in the post but was a lot of work finding out! Now here’s what I did.

Installing FreeNX

I found the RPMs for FreeNX in the standard SuSE repositories, but there are probably newer builds to be found here:

http://download.opensuse.org/pub/opensuse/repositories/X11:/RemoteDesktop/openSUSE_11.4/

It seems to be sufficient to install these packages on the server:

#  rpm -qa|grep -i nx
NX-3.4.0-21.1.x86_64
FreeNX-0.7.2-29.1.x86_64

Instead of a post-install action as part of the RPM installation you have to take action to configure the nx server. First, you need to create this directory tree if you haven’t got cups installed on the server (FreeNX makes use of CUPS for printer sharing).

# mkdir -p /usr/lib64/cups/backend/ipp

Now you should be able to run the configuration. One thing you will notice is a prompt to generate a SSH key. I would suggest you let it generate the key rather than using the very well-known NoMachine key. Since FreeNX relies on SSH to get access to the box it’s quite secure, theoretically. The key generated during the installation will be used to authenticate client sessions. Here is the output from the configuration session on my host with some detail removed:

nxhost:~ # nxsetup --install  --clean --purge
------> It is recommended that you use the NoMachine key for
easier setup. If you answer "y", FreeNX creates a custom
KeyPair and expects you to setup your clients manually.
"N" is default and uses the NoMachine key for installation.

Do you want to use your own custom KeyPair? [y/N] y
Removing special user "nx" ...no crontab for nx
done
Removing session database ...done
Removing logfile ...done
Removing home directory of special user "nx" ...done
Removing configuration files ...done
Setting up /etc/nxserver ...done
Generating public/private dsa key pair.
Your identification has been saved in /etc/nxserver/users.id_dsa.
Your public key has been saved in /etc/nxserver/users.id_dsa.pub.
The key fingerprint is:
f0:70:91:36:ed:asdsadsadsadsad9:21:7e:d1:9a root@nxhost
The key's randomart image is:
+--[ DSA 1024]----+
|o+.. o  .o       |
...
|                 |
+-----------------+
Setting up /var/lib/nxserver/db ...done
Setting up /var/log/nxserver.log ...done
Setting up special user "nx" ...done
Adding user "nx" to group "utmp" ...done
Setting up known_hosts and authorized_keys2 ...Unique key generated; your users must install

/var/lib/nxserver/home/.ssh/client.id_dsa.key

on their computers.
done
Setting up permissions ...done
Setting up cups nxipp backend ...done

----> Testing your nxserver configuration ...
Warning: Invalid value "COMMAND_FOOMATIC=/usr/lib64/cups/driver/foomatic-ppdfile"
Users will not be able to use foomatic.
Warning: "/usr/lib64/cups/backend/smb" is not executable.
Users will not be able to enable printing.
Warning: Invalid value "CUPS_ETC=/etc/cups/"
Users will not be able to enable printing.
Warning: Invalid value "COMMAND_START_KDE=startkde"
Users will not be able to request a KDE session.
Warning: Invalid value "COMMAND_START_CDE=cdwm"
Users will not be able to request a CDE session.
Warning: Invalid value "COMMAND_SMBMOUNT=smbmount". You'll not be able to use SAMBA.
Warning: Invalid value "COMMAND_SMBUMOUNT=smbumount". You'll not be able to use SAMBA.
Warning: Invalid cupsd version of "/usr/sbin/cupsd". Need version 1.2.
Users will not be able to enable printing.

Warnings occured during config check.
To enable these features please correct the configuration file.

<---- done

----> Testing your nxserver connection ...
HELLO NXSERVER - Version 2.1.0-72 OS (GPL, using backend: 3.4.0)
<--- done

Ok, nxserver is ready.

PAM authentication enabled:
All users will be able to login with their normal passwords.

PAM authentication will be done through SSH.
Please ensure that SSHD on localhost accepts password authentication.

You can change this behaviour in the /etc/nxserver/node.conf file.

Warning: Clients will not be able to login to this server with the standard key.
Please replace /usr/NX/share/client.id_dsa.key on all clients you want
to use with /var/lib/nxserver/home/.ssh/client.id_dsa.key
and protect it accordingly.

Since 1.5.0 you need to import the correct key via the GUI.

If you really want to use the NoMachine key please remove
'/var/lib/nxserver/home/.ssh/authorized_keys2'
and then run this script with the --setup-nomachine-key parameter.
Have Fun!

Now this looked promising! If problems are reported at this stage they are usually a) related to missing CUPS and b) due to sshd not listening at 127.0.0.1.

Configuring the node

All configuration work is done in /etc/nxserver/node.conf. I had to experiment quite a while to get this right and more or less secure. All users I allow in (really myself!) have to authenticate via username and password. I haven’t moved the SSH port to <> 22 but secured SSH so that I can only get in with a key file or openVPN. That should be enough I think, and except for a number of boring VMs for Oracle RAC there isn’t anything on the box anyway. To add users to the NX database, you can use these commands:

nxserver --adduser username
nxserver --passwd username

Replace username with the real username (obviously). Oh and you need to ensure that the keys are in authorized_keys2 in sshd_config such as shown here:

# grep authorized_keys2 /etc/ssh/sshd_config
AuthorizedKeysFile     .ssh/authorized_keys2

It’s a good idea a this point to rename any $HOME/.ssh/authorized_keys file to $HOME/.ssh/authorized_keys2 and test that you can get in before continuing. Whenever a change is made to the sshd_config, you need to reload the SSH daemon to make them active.

Next, edit node.conf and set these variables (some might need commenting out)

  • ENABLE_PASSDB_AUTHENTICATION=”0″
  • ENABLE_SSH_AUTHENTICATION=”0″
  • ENABLE_SU_AUTHENTICATION=”1″
  • ENABLE_USER_DB=”1″
  • NX_LOG_LEVEL=5
  • NX_LOG_SECURE=1
  • NX_LOGFILE=/var/log/nxserver.log
  • SESSION_LOG_CLEAN=0

Refer to the comments right above the variable for their meaning. The last 4 variables only need to be set temorarily for debugging. Note that this is only an example for authentication, other uses might be more appropriate for you-this one just worked.

To allow SU authentication, the nx user must be in group “users”. The easiest way to do so is via yast, but a usermod does the same. I have run into a rather nasty problem initially related to fonts. From the client log file I got this output:

# cat
/home/martin/.nx/F-C-xxx-1001-DF87C0E8D1C786FF18EBE57A1F32AE8A/session

NXAGENT - Version 3.4.0

Copyright (C) 2001, 2010 NoMachine.
See http://www.nomachine.com/ for more information.

Info: Agent running with pid '16169'.
Session: Starting session at 'Thu Nov  3 16:17:26 2011'.
Info: Proxy running in server mode with pid '16169'.
Info: Waiting for connection from '127.0.0.1' on port '5001'.
Info: Accepted connection from '127.0.0.1'.
Warning: Connected to remote version 3.5.0 with local version 3.4.0.
Warning: Consider checking http://www.nomachine.com/ for updates.
Info: Connection with remote proxy completed.
Info: Using ADSL link parameters 512/24/1/0.
Info: Using agent parameters 5000/10/50/0/0.
Info: Using cache parameters 4/4096KB/16384KB/16384KB.
Info: Using pack method 'adaptive-7' with session 'unix-gnome'.
Info: Using ZLIB data compression 1/1/32.
Info: Using ZLIB stream compression 4/4.
Info: No suitable cache file found.
Info: Listening to X11 connections on display ':1001'.
Info: Established X client connection.
Info: Using shared memory parameters 1/1/1/4096K.
Info: Using alpha channel in render extension.
Info: Not using local device configuration changes.
Error: Aborting session with 'Could not open default font 'fixed''.
Session: Aborting session at 'Thu Nov  3 16:17:28 2011'.
Session: Session aborted at 'Thu Nov  3 16:17:28 2011'.
Warning: Signals were not blocked in process with pid '16169'.
Info: Watchdog running with pid '16186'.
Info: Waiting the watchdog process to complete.
xrdb: Connection refused
xrdb: Can't open display ':1001'

It appeared to me that the problem was related to the “fixed” font package. I could solve this only by trial and eror, until I found the terminus-font package, installed it and got running. I think I could have installed the nx client on the server as well as it seems to have gone through it’s installation directories but didn’t have to in the end. The SuSE SDB entry for FreeNX recommends the following permission change:

# chown nx /var/lib/nxserver

I can’t comment on the other items in “Setting up things” from http://en.opensuse.org/SDB:FreeNX_server#openSUSE_11.3, I haven’t set up any of these and didn’t run into problems.

Installing and configuring the client

I tried the qtnx client but it repeatedly failed on me so reverted back to the official 3.5 client from the nomachine website. This works as a treat and is available for all major platforms, including deb and rpm packages. The client is installed in /usr/NX/-and this is the reason I went for FreeNX, as it adheres more to the LSB.

One thing to remember is to get the key you generated during the nxerver setup and copy it into your client session-otherwise you will never be able to connect to your host. You need to copy everything in BEGIN … to … END (including these lines) from the key (/var/lib/nxserver/home/.ssh/client.id_dsa.key) into the key field accessible in the general settings once the wizard completed. I don’t want to reproduce the client setup here, a good introduction is found in reference [4].

Before hitting the “connect” button I suggest that you tail -f /var/log/nxserver to troubleshoot the connection. Item [3] in the reference sections answers a number of FAQs around the troubleshooting, and I also found other distribution’s documentation (like ubuntu) very useful. FreeNX really is distribution agnostic which in this case is great.

I ran into a problem where gnome wasn’t installed correctly on my NX server, a call to zypper to install the gnome pattern solved the problem. I was truly amazed with the usability of the software, it was great fun working with it.

Summary

When you have been able to create sessions without any problems, change the log level for NX server to 0 or 1set session_log_clean to 1 again to avoid your hard disk from filling up with log information you are not interested in anyway.

References

  1. http://en.opensuse.org/SDB:FreeNX_server
  2. http://www.susethailand.com/suseforum/index.php?action=printpage;topic=1...
  3. http://openfacts2.berlios.de/wikien/index.php/BerliosProject:FreeNX_-_FAQ
  4. http://www.ews.uiuc.edu/userguide/clrtxt/nx_setup/