Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

The other day

The other day, we had a serious issue in the ASM diskgroups - one diskgroup refused to come up because one disk was missing; but it was not clear from the message which of the 283 devices was missing. This underscores the difficulty in diagnosing ASM discovery related issues. In this post, I have tried to present a way to diagnose this sort of issues through a real example.

We had planned to move from one storage device to another (EMC DMX-3 to DMX-4) using SRDF/A technology. The new storage was attached to a new server. The idea was to replicate data at the storage level using SRDF/A. At the end of the replication process, we shut the database and ASM down and brought up the ASM instance in the newer storage on the new server. Since the copy was disk level, the disk signatures were intact and the ASM disks retained their identity from the older storage. So, when the ASM instance was started, it recognized all the disks and mounted all the diskgroups (10 of them) except one.

While bringing up a disk group called APP_DG3 on the new server it complained that disk number “1” is missing; but it was not clear which particular disk was. In this blog the situation was diagnosed and performed.

Note: the asm disk paths changed on the storage. This was not really a problem; since we could simply define a new asm_diskstring. Remember: the diskstring initialization parameter simply tells the ASM instance which disks should be looked at while discovering. Once those disks are identified, ASM looks at its signature on the disk headers to check the properties - the disk number, the diskgroup it belongs to, the capacity, version compatibilty and so on. So as long as the correct asm_diskstring init parameter is provided, ASM can readily discover the disks and get the correct names.

Diagnosis

This issue arises when ASM does not find all the disks required for that disk group. There could be several problems:

(i) the disk itself is physically not present
(ii) it’s present but not allowed to be read/write at the SAN level
(iii) it’s present but permissions not present in the OS
(iv) it’s present but the disk is not mapped properly; so the disk header shows something else. ASM knows the disk number, group, etc. from the disk header. If the disk header is not readable; or is not an ASM disk, the header will not reveal anything to ASM and hence will not mount.

If an ASM diskgroup is not mounted, the group_number for that disk shows “0”. If it’s mounted, the group number shows whatever the group number of the disk group is. Please note: the disk numbers are dynamic. So, APP_DG1 may have a group number “1” but the number may change to “2” after the next recycle.

Since the issue involved APP_DG3, I checked the group number for the group APP_DG3 from the production ASM (the old SAN on the old server) by issuing the query:

ASM> select group_number, name
2 from v$asm_diskgroup
3 where name = 'APP_DG3'
4 /

GROUP_NUMBER NAME
------------ ------------------------------
4 ASM_DG3

This shows the group number is 4 for the APP_DG3 group. I will use this information later during the analsysis.

On the current production server, I checked the devices and disk number of group number 4:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 4;

DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c83t7d0
1 /dev/rdsk/c83t13d5
2 /dev/rdsk/c83t7d2
… and so on …
53 /dev/rdsk/c83t7d1

54 rows selected.

On the new server, I listed out the disks not mounted by the disk groups. Knowing that disks belonging to an unmounted diskgroup show a group number 0, the following query pulls the information:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 0;

DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c110t1d1
2 /dev/rdsk/c110t1d2

3 /dev/rdsk/c110t1d3
… and so on …
254 /dev/rdsk/c110t6d7

54 rows selected.

Carefully study the output. The results did not show anything for disk number “1”. The disks were numbered 0 followed by 2, 3 and so on. The final disk was numbered “254”, instead of 54. So, the disk number “1” was not discovered by ASM.

From the output we know that production disk /dev/rdsk/c83t7d0 mapped to new server disk /dev/rdsk/c110t1d1, since they have the same disk# (“0”). For disk# 2, production disk /dev/rdsk/c83t7d2 is mapped to /dev/rdsk/c110t1d2 and so on. However, production disk /dev/rdsk/c83t13d5 is not mapped to anything on the new server, since there is no disk #1 on the new server.

Next I asked the Storage Admin what he mapped for disk /dev/rdsk/c83t13d5 from production. He mentioned a disk called c110t6d25.

I checked in the new server, if that disk is even visible:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

no rows selected

It confirmed my suspicion – ASM can’t even read the disk. Again, the reasons could any of the above mentioned ones - disk is not presented, does not have correct permission, etc.

In this case the physical disk was actually present and was owned by “oracle”; but not accessible to ASM. The issue was with SRDF not making the disk read/write. It was still in sync mode, preventing the disk to be enabled for writing. ASM couldn't open the disk in read write mode; so it rejected it as a member of any diskgroup and assigned it a default disk number 254.

After Storage Admin fixed the issue by making the disk read write, I re-issued the discovery:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

PATH
-------
/dev/rdsk/c110t6d25

It returned with a value. Now ASM can read it correctly. I mounted the disk:

ASM> alter diskgroup APP_DG3 mount;

It mounted successfully; because it got all the disks to make up the complete group.

After that the disk# 254 also went away. Now the disks showed 0, 1, 2, 3, … 53 for the group on both prod and the new server.

Mystats utility

A variation on Jonathan Lewis's SNAP_MY_STATS package to report the resource consumption of a unit of work between two snapshots. Designed to work under constrained developer environments, this version has enhancements such as time model statistics and the option to report on specific statistics. ***Update*** Now available in two formats: 1) as a PL/SQL package and 2) as a free-standing SQL*Plus script (i.e. no installation/database objects needed). June 2007 (updated October 2011)

Runstats utility

A variation on Tom Kyte's invaluable RUNSTATS utility that compares the resource consumption of two alternative units of work. Designed to work under constrained developer environments and builds on the original with enhancements such as "pause and resume" functionality, time model statistics and the option to report on specific statistics. ***Update*** Now available in two formats: 1) as a PL/SQL package and 2) as a free-standing SQL*Plus script (i.e. no installation/database objects needed). January 2007 (updated October 2011)

Row Migration can Aggravate Contention on Cache Buffers Chains Latch

AWR Wait EventsOne of my customers has an Oracle based system with a large number of concurrent users. For time to time, the users would report that the system would 'grind to a halt'. Examination of AWR data showed lots of time spent waiting on latch: cache buffers chains (I discussed how to graph AWR data in Excel in a previous blog entry, which is how I produced this graph of database wait events).

They system had been able to go for weeks without an incident. More recently, as more users have been migrated onto the system, we would a series of days with spikes on this wait event, though they were normally quite short-lived. The incident shown in this chart was unusual in that it lasted most of a day. The chart also shows that a reasonable amount of time was lost on db file sequential read, this is consistent with blocks being loaded into the buffer cache, requiring access to the cache buffer chains, and hence requiring access to the latch that protects these chain.

Concurrent Database SessionsUsing a similar technique I was graph the number of database sessions over time, and I found a similar series of spikes.

By eye I could see that the spikes in the latch contention seemed to correspond to the spikes in the number of concurrent user sessions, and they had roughly the same shape.

I then graphed time waited for this latch against the number of concurrent user sessions and I got a strong, and surprisingly linear, correlation.Cache Buffers Chains Latch Wait -v- Concurrent Sessions

It is very clear that whenever we had more than about 90 concurrent user sessions, the system also lost time waiting on the cache buffers chains latch. Basically, this application won't scale any further!

So, why was the latch being held for such a long time? The buffer cache is 1Gb (with a block size of 8Kb), so it is large, but not excessively so. At this point somebody (and I can't claim the credit for this) thought to check for migrated rows on the tables involved in the longest running SQL statements during the periods of high latch contention, and discovered that a number of the most heavily use tables had a significant quantity of migrated rows.

A row migrates when the data is updated and there is no free space left in the data block to store the new data values. Oracle puts the row into a new block, but it does not update the indexes to point to the new block. Instead, it puts a forwarding pointer into the original block. Therefore, to read the data from a migrated row requires visits to two blocks. When retrieving data into the buffer cache your process need to update two buffer blocks, and must acquire the buffer chains latch twice. Row migration causes poor performance because Oracle must do twice as much work.

There are two factors that come together to cause row migration:

  • A process inserts a row, and then subsequently updates columns on that row, often from null to not null values, but sometimes with just more data. In my case, the application was storing XML structures in a clob (and most of the clobs were small enough to be stored in-line with the rest of the row).
  • There is insufficient free space left in the data blocks to allow for the subsequent updates to be stored in the original data block.

In this particular case, I was dealing with a third-party packaged application. So there was no possibility to change the way the application inserts and updats the data. However, we could and did rebuild the object to eliminate migration of existing rows and set a realistic PCTFREE to reserve free space for future rows.

AWR Wait, magnifiedThis chart shows the same data as the first one, except that I have changed the scale on the y-axis. The tables with the worst row migration were reorganised on the Tuesday night and the system ran on Wednesday under a normal load without any problem.

Contention on the buffer cache chains latch fell to trivial levels, just 408 seconds in 24 hours. Interestingly, the amount of db file sequential read also fell by 70%. This is due to better cache efficiency. Since eliminating much of the row migration, fewer blocks need to be loaded into cache, and so blocks stay in the buffer cache for longer.

This is a real-life example of:

  • why it is important, sometimes critically so, to set physical attributes on tables properly;
  • why it is necessary to understand how your application is updating the database;
  • and what are the potential implications of not doing so!

communication, clarity and the benefits of real names

--- just found this one. I wrote it in Jan 09 and didn't publish it: I think I planned on adding a lot of links and ran out of time. Publishing it under the original date, but if you've been here before, you didn't miss it the first time. I slipped it in ... oh no! I've cheated on the version control system !! :)For the past month, my work has been somewhat disconnected from the database.

Migrate from Windows XP 64bit to Ubuntu Intrepid Ibex 8.10 64bit

I’ve been using the newest and the greatest version of Ubuntu (Intrepid 8.10) for almost a month now and I’m happy with it </p />
</p></div>

    	  	<div class=

As Doug hinted

As Doug hinted in a recent comment, I have been invited to join the Oak Table and I have enthusiastically accepted. It's quite an honor and also, somewhat ironic.At different times in my career, I have found myself in employment discussions with Oracle. However, like star crossed lovers, when one of us was interested, the other was involved elsewhere. At some point, I decided it just wasn't

God rest ye merry gentlemen ...

Unbelievable ... it's already been two weeks since I returned from UKOUG. I enjoyed this year's conference even more than last year, in part because I paced myself better. I was reunited with many wonderful friends and made some new ones too. Add to that a week of top notch presentations, some excellent dinners with good company and it was absolutely perfect.So here are a few of the

Teaching a horse not to eat

My friend Ole told me this story many moons ago - and many times since:

A man decided to teach his horse not to eat anymore. Gradually he reduced the amount of food the horse got each day, and the programme worked really well.

Unfortunately, just as he had finally taught the horse not to eat at all something unforeseen and tragic happened to the horse and it died suddenly.

I was reminded of the story the other day when I was studying this article in my beloved The Economist (I've subscribed non-stop since 1983):

Stopping in a hurry

Dec 11th 2008
From The Economist print edition

Cars are getting better at avoiding collisions. Before long they may be communicating with each other to make roads safer

.... and somewhere in the article this is stated:

"Jan Ivarsson, head of safety at Volvo, believes it should be possible to build a car in which people will not be killed or injured."

On the other hand I can read in various newspapers that Volvo is not doing too well, and may in fact soon be either sold or closed, just like Saab. Or maybe Sweden will try to put those two together and create a small (by international standards) entity that might survive with state funding and what have you.

So you have this carmaker - Volvo - who has been making cars safer and safer and safer over the last several decades, and JUST as they're sensing the possibility of making the perfectly safe car - in which people will not get killed - the carmaker Volvo unfortunately died. Like the horse.

In my own, little world I have also been witnessing how perfect the databases are getting, how much they can do, how much stuff you can put into them in order to save on the application coding and development side - and how coders, developers and programmers have stopped using them. Just as databases were getting damn near perfect ... people stopped using them.

I have for several years now claimed that any computer technology that reached a state of perfection, a plateau of predictability & stability and a high level of usefulness ... will be replaced with something more chaotic and hence much less productive. I have seen no exceptions.

I now realise it is connected: Technology reaching maturty, car safety reaching its logical conclusion - and feeding of horses.

Introduction to advanced queuing

A high-level tutorial on Oracle's Advanced Queuing. July 2005