Search

Top 60 Oracle Blogs

Recent comments

January 2011

Teradata との比較

Exadata – the Sequel Exadata V2 is Still Oracle(http://www.teradata.com/t/assets/0/206/276/5bfc4694-ce82-4a07-867d-3f104...)より

Shared Everything vs. Shared Nothing
Nothingというとネガティブな印象になりがちだが、けしてそういうわけではない。
TeradataはShared Nothing方式をとることによりI/Oボトルネックを防いでいる。

ExadataはASMにより均等にCellにデータを配置する。これをSAMEアーキテクチャ(Stripe And Mirror Everywhere(あるいはEverything))と呼ぶ。Parallel Queryは均等にストライプされたCellからReadするわけだから、Parallel Query Slave(下記の図ではWorkerと記述されている)同士でもI/O待ちは発生する。そして同時セッション数とI/O待ちが比例する。
数秒で終わるQueryでも数十分かかるQueryでも「全てのParallel Query Slave」が「全てのCell」を検索する。

UKOUG TechEBS 2010 - Conference Thoughts

I wish I could say I planned to break the all-time record for post-conference blogging delay and was waiting until Debra Lilley got round to hers, but the truth is this is yet another post that slipped through the cracks in the constant work/play/travel continuum.

One very unfortunate side-effect of the growth of Twitter that I've noticed over the past year or so is the plummeting number of mid-conference blog posts from both myself and others. I think everyone finds it difficult to blog in the frantic conference atmosphere and Twitter offers a nice easy way out - pumping out short, virtually meaningless messages but, hey, as long as we all feel we're connecting, eh? So I find myself in the sorry position of having to look through old tweets to remind myself what on earth I got up to in Birmingham at the end of November, although some parts will live long in my memory!

This (last?) year was always going to be a slightly different conference experience for me because I wasn't presenting and planned a shorter two-and-a-bit day trip, mainly to catch up with friends and attend a few presentations. It became a lot shorter when my planned Sunday socialising was wiped out by the 14 hour trip, but I suppose I was one of the lucky ones who made it and in time to have a decent sleep before the first day of presentations.

Monday

My first presentation was Tom Kyte's - What's new in Oracle Database Application Development. Of the many interesting bits and pieces I picked up from this was the reminder of how successfully the modern optimiser can eliminate chunks of workload such as unneccesary outer joins. Most of that type of optimisation is dependant on safe assumptions the optimiser can make when the correct referential integrity constraints are defined. Another good reason to keep up good physical modelling principles.

One of the audience questions at the end interested me, asking if Oracle had any plans to be able to synchronise two partition exchanges so that they become part of the same DDL operaton. I was interested because I was faced with that specific challenge at the time. Of course, with further thought it became apparent that we didn't really need to perform two partition exchanges at the same time, we just needed to fix our data model!

However, my best laugh came when he was asked what I thought had become Graham Wood's personal question at the end of the presentation. The great Diagnostics Pack Licensing Question! There are variations on the theme, but it's essentially ... 'Do we really have to pay extra for x, y or z' ... usually the Diagnostics Pack.

Next up was - Sane SAN 2010 - James Morle - although not before I'd drawn far too much attention to myself by walking into a brightly lit room with my ridiculous moustache and side-burns only to see a friendly, inquisitive face to remind me what I had growing on my own! (As the day progressed, I tried to sneak in a little late once the lights were down ;-))

The best thing about James' presentation apart from the sensible high-level messages about how awful storage performance has been for years, some of the reasons why and what an utter game-changer Flash is, the best part for the lazy blogger is that he has a complete paper on the subject to read here. Read it, weep a little, then rejoice that things are improving.

Oracle 'Hidden' Features: Features You Probably Didn't Know About and are FREE - Graham Wood. (Mmmm, well, mostly free, apart from some that require that damned Diagnostics Pack license! ;-) In fairness, many people have paid already and aren't making full use of what they've paid for.)

I particularly enjoyed this and I think it's because it felt like Graham, who does a lot of Oracle keynote presentations these days, was getting back to what he does best - showing off all sorts of little useful technical details that many people would be unaware of, and perhaps only some of us care ;-) in addition to the big ticket items like ASH and ADDM. He spent a fair bit of time walking through how 11gR2 trace files look, including the proper capture of CLOSE statements, and emphasising that we should use DBMS_MONITOR for tracing (I always do these days) which has parameters like PLAN_STAT that control how often row source execution statistics are dumped to the trace file.

On a related note, for more great information on tracing, see Cary Millsap's latest paper.

My presentation of the conference was Oracle Optimizer - Top Tips to get Optimal SQL Execution all the Time by Maria Colgan in which she talked about the most common types of problems you see in four areas

  1. Cardinality
  2. Access path
  3. Join type
  4. Join order

There was so much useful information, I'd suggest you get hold of a copy of the slides if you can. One thing I noted personally was that SQL Monitoring is probably the best method for getting the A-Rows (actual rows) value for parallel queries because 'ALLSTATS LAST' doesn't work well with PQ.

It made me chuckle away to myself to hear her constantly refer to the optimiser as 'he'. Yes, I thought. A male optimiser. Unreliable, temperamental and sometimes difficult to work with ;-)

Maria was undoubtedly the best speaker I heard and I met many others around the conference who agreed. I'd already seen here present at OOW so I knew she was good (although I've also skipped a couple of presentations too which she doesn't let me forget ;-)) I was lucky enough to have dinner with her, Peter Robson and Dan Fink later and, despite the fact I was three sheets to the wind (after taking too much dutch courage to get myself up on stage for the Pantomime) she was excellent company too. At least I think she was although there might have been times she struggled to get a word in edge-ways ;-)

There are plenty of opportunities to hear Maria present at upcoming conferences including Openworld and the Hotsos Symposium. Give them a try and I don't think you'll be disappointed. (What's the betting we end up scheduled to speak at the same time at Hotsos and she wipes out my audience? LOL)

It would be a great shame if the UKOUG don't get Maria and more of the great Oracle Corp speakers from Redwood Shores over. I can do without marketing presentations and I like to see a User Group focus on community presentations, but I know from Openworld that there are great presenters inside Oracle who really know their stuff and it's a nice balance to the user stories.

OK, I think I've sucked up enough now to excuse missing her earlier presentations and being drunk at dinner ;-)

I was able to catch a little bit of Daniel Fink's Customising ASH and AWR: Beyond ashrpt.sql and awrrpt.sql. My main reason for attending was that I know Dan and enjoy listening to him present but I didn't expect to learn too much, given that I teach a course on ASH and AWR. It was nice to see him draw attention to the AWR differences report, though. The more people who know about that the better.

Which just left me enough time for some pre shave-off dutch courage and moral support from friends before the Keynote Presentation by David Callaghan. I turned up late, am not the greatest fan of Oracle keynotes and was petrified by what was to come, so can't remember a thing!

Next I made a fool of myself but there are already enough Panto blogs, photos and videos floating around!

Tuesday

Tuesday was a bit of an anti-climax for me, largely because I had to miss even more presentations because my skin condition was playing up so I had to hunt down some medical supplies and take care of myself. (On a brighter note, the long delay between the conference and this post means that my condition's improved sufficiently that I needn't keep whining about it.)

As one of the slides of Wolfgang Breitling's Time Series Analysis Techniques for Statspack or AWR stated - "The main purpose of time series analysis is identifying patterns in past events which can be used to forecast future values and events." It's an interesting concept and if I had a fraction of the maths that might have helped me understand it, I think I would have enjoyed the presentation much more ;-) But Wolfgang is one of those presenters who always have something to say that interests me and it's probably a topic worth investigating further, perhaps using some of the references to R that he included. Summary - I remember enjoying it but some of it sailed over my head!

Ironically, given the number of times he's recommended taking detailed notes as one of the cornerstones of his personal approach to conferences, I can't remember the first thing about Jonathan Lewis' - Co-operating with the database. It doesn't help that he hasn't uploaded his slides either, but I remember him being utterly compelling to listen to although I doubt that helps much if none of it stuck in my head! I'll put it down to my utterly distracted state that day.

My final presentation of the week was How to Build a System that Doesn't Scale - Graham Wood - a presentation all about the things you *shouldn't* do! It turned out that I was actually quite familiar with a lot of the information from various Real World Performance Group presentations I've attended at Openworld, but Graham walked through a very interesting spreadsheet showing different benchmark results obtained when misusing various configuration options which showed, for example, just how unscalable systems are that don't manage connections carefully or use bind variables.

Slides for most of the presentations are available from here, although you will need the username and password so it only really works if you attended the conference or maybe someone from your organisation did. If you are particularly interested in any of the presentations, I suppose you could always email the presenter directly.

I finished off my conference with several enjoyable beers at the Tuesday evening networking which was sponsored by OTN and then it was back to work the following morning.

Phew, I am *so* glad I've finally got this post done and can move on to other things. It's been bugging me!

Adding Comments to SQL Statements Improves Performance?

January 15, 2011 While reading the “Pro Oracle SQL” book I learned something interesting.  Commenting your work can improve database performance.  You certainly are aware that thoroughly documenting what you do could prevent hours of headaches that might appear later when trying to investigate problems or add additional features to an existing procedure (I think that was [...]

An Axiomatic Approach to Algebra and Other Aspects of Life

Not many days pass that I don’t think a time or two about James R. Harkey. Mr. Harkey was my high school mathematics teacher. He taught me algebra, geometry, analytic geometry, trigonometry, and calculus. What I learned from Mr. Harkey influences—to this day—how I write, how I teach, how I plead a case, how I troubleshoot, .... These are the skills I’ve used to earn everything I own.

EHCC Mechanics – Proof that whole CU’s are not decompressed

I saw an interesting post recently where Greg Rahn talked about HCC mechanics. He claimed that an update to a record stored in HCC format did not require decompressing the whole Compression Unit (CU) which consist of several Oracle blocks. I’m assuming by this he meant that all the records contained in the CU did not get written back to storage in a non-HCC format due to a single record being updated. Greg then showed an example proving row migration occurred for an updated record. He didn’t show that the other records had not been decompressed though. So since I was already working on an HCC chapter for the upcoming Apress Exadata book, I thought I would take time off from the book writing to post this (hopefully the editors will forgive me).

Here’s the recipe: Basically we’ll update a single row, see that its rowid has changed, veify that we can still get to the record via its original rowid, and check to see if the TABLE FETCH CONTINUED ROW statistic gets updated when we we access the row via its original rowid, thus proving basic row migration (this is what Greg has already shown). Then we’ll look at block dumps for the original and new block to see what’s there.

-bash-3.2$ sqlplus / as sysdba
 
SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 14 14:16:20 2011
 
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
 
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
 
SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367;
 
ROWID              LOCATION
------------------ --------------------
AAATCBAAIAAF8uSFc9 8.1559442.22333
 
SYS@SANDBOX1> -- so my row is in file 8, block 1559442, slot 22333
SYS@SANDBOX1> 
SYS@SANDBOX1> update kso.skew_hcc3 set col1=col1 where pk_col=16367;
 
1 row updated.
 
SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367;
 
ROWID              LOCATION
------------------ --------------------
AAATCBAAHAAMGMMAAA 7.3171084.0
 
SYS@SANDBOX1> -- Ha! The rowid has changed – the row moved to file 7, block 3171084, slot 0
SYS@SANDBOX1> 
SYS@SANDBOX1> -- Let's see if we can still get to it via the original rowid
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> -- Yes we can! – can we use the new rowid?
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAHAAMGMMAAA';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> -- That works too! – It’s a migrated Row!
SYS@SANDBOX1> -- Let’s verify with “continued row” stat
SYS@SANDBOX1> 
SYS@SANDBOX1> @mystats
Enter value for name: table fetch continued row
 
NAME                                                                             VALUE
---------------------------------------------------------------------- ---------------
table fetch continued row                                                         2947
 
SYS@SANDBOX1> -- select via the original rowid
SYS@SANDBOX1> 
SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9';
 
    PK_COL
----------
     16367
 
SYS@SANDBOX1> @mystats
Enter value for name: table fetch continued row
 
NAME                                                                             VALUE
---------------------------------------------------------------------- ---------------
table fetch continued row                                                         2948
 
SYS@SANDBOX1> -- Stat is incremented – so definitely a migrated row!

So the row has definitely been migrated. Now let’s verify that the migrated row is not compressed. We can do this by dumping the block where the newly migrated record resides.

SYS@SANDBOX1> !cat dump_block.sql
@find_trace
alter system dump datafile &fileno block &blockno;
 
SYS@SANDBOX1> @dump_block
 
TRACEFILE_NAME
------------------------------------------------------------------------------------
/u01/app/oracle/diag/rdbms/sandbox/SANDBOX1/trace/SANDBOX1_ora_5191.trc
 
Enter value for fileno: 7
Enter value for blockno: 3171084
 
System altered.

Now let’s look at the trace file produced in the trace directory. Here is an excerpt from the block dump.

Block header dump:  0x01f0630c
 Object id on Block? Y
 seg/obj: 0x13081  csc: 0x01.1e0574d4  itc: 3  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x1f06300 ver: 0x01 opc: 0
     inc: 0  exflg: 0
 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x002f.013.00000004  0x00eec383.01f2.44  ----    1  fsc 0x0000.00000000
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
bdba: 0x01f0630c
data_block_dump,data header at 0x2b849c81307c
===============
tsiz: 0x1f80
hsiz: 0x14
pbl: 0x2b849c81307c
     76543210
flag=--------
ntab=1
nrow=1
frre=-1
fsbo=0x14
fseo=0x1f60
avsp=0x1f4c
tosp=0x1f4c
0xe:pti[0]      nrow=1  offs=0
0x12:pri[0]     offs=0x1f60
block_row_dump:
tab 0, row 0, @0x1f60
tl: 32 fb: --H-FL-- lb: 0x1  cc: 5
col  0: [ 4]  c3 02 40 44
col  1: [ 2]  c1 02
col  2: [10]  61 73 64 64 73 61 64 61 73 64
col  3: [ 7]  78 6a 07 15 15 0b 32
col  4: [ 1]  59
end_of_block_dump

Notice that there is only one row in the block (nrow=1). Also notice that the object_id is included in the block (in hex format). It is labeled “seg/obj:”. The table has 5 columns. The values are displayed – also in hex format. Just to verify that we have the right block and row we can translate the object_id and the value of the first column as follows:

SYS@SANDBOX1> !cat obj_by_hex.sql
col object_name for a30
select owner, object_name, object_type
from dba_objects
where object_id = to_number(replace('&hex_value','0x',''),'XXXXXX');
 
SYS@SANDBOX1> @obj_by_hex
Enter value for hex_value: 0x13081
 
OWNER                          OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------------------ -------------------
KSO                            SKEW_HCC3                      TABLE
 
 
SYS@SANDBOX1> desc kso.skew_hcc3
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 PK_COL                                 NUMBER
 COL1                                   NUMBER
 COL2                                   VARCHAR2(30)
 COL3                                   DATE
 COL4                                   VARCHAR2(1)
 
SYS@SANDBOX1> !cat display_raw.sql
col value for a50
select display_raw(replace('&string',' ',''),nvl('&TYPE','VARCHAR2')) Value from dual
/
 
SYS@SANDBOX1> @display_raw
Enter value for string: c3 02 40 44
Enter value for type: NUMBER
 
VALUE
--------------------------------------------------
16367

As you can see, this is the record that we updated earlier in the SKEW_HCC3 table. Note: display_raw.sql depends on a little function called display_raw() which coincidentally I got from Greg Rahn. Here’s a script to create the function: create_display_raw.sql

Now for a quick look back at the original block (note that in my testing I updated another row in this same block):

===============
tab 0, row 1, @0x32
tl: 5086 fb: --H-F--N lb: 0x3  cc: 1
nrid:  0x0217cb93.0
col  0: [5074]
Compression level: 03 (Archive Low)
 Length of CU row: 5074
kdzhrh: ------PC CBLK: 1 Start Slot: 00
 NUMP: 01
 PNUM: 00 POFF: 5054 PRID: 0x0217cb93.0
CU header:
CU version: 0   CU magic number: 0x4b445a30
CU checksum: 0x982dec03
CU total length: 11403
CU flags: NC-U-CRD-OP
ncols: 5
nrows: 32759
algo: 0
CU decomp length: 7266   len/value length: 945436
row pieces per row: 1
num deleted rows: 2
deleted rows: 22333, 30848,
START_CU:
 00 00 13 d2 1f 01 00 00 00 01 00 00 13 be 02 17 cb 93 00 00 00 4b 44 5a 30
 03 ec 2d 98 00 00 2c 8b eb 06 00 05 7f f7 00 0e 6d 1c 01 00 02 00 00 00 00

So this little except shows that this is an HCC compressed block (Compression level: 03 (Archive Low) and many CU references). The nrows line shows us that the block contains 32759 rows. It also shows that 2 rows have been deleted from the block (num deleted rows). Notice that one of the deleted rows is the one in slot 22333 (sound familiar). If you’ll look back at the original rowid in the old format (fileno.blockno.slot) you’ll see that it is the row we updated. It was “deleted” from this block when it was migrated to the new block. Of course there is still a pointer left behind.

SYS@SANDBOX1> select old_rowid('AAATCBAAIAAF8uSFc9') location from dual;
 
LOCATION
-----------------------------------------
8.1559442.22333
 
 
SYS@SANDBOX1> select old_rowid(rowid) location, a.* from kso.skew_hcc3 a where rowid = 'AAATCBAAIAAF8uSFc9';
 
LOCATION                 PK_COL       COL1 COL2                           COL3      C
-------------------- ---------- ---------- ------------------------------ --------- -
7.3171084.0               16367          1 asddsadasd                     21-JUL-06 Y

It’s hard to prove a negative, but it does not appear that any records are decompressed other than those that are actually updated. The other rows in the block appear to remain in HCC format.

Clustered ASM and RAC password file maintenance

A recurring question during Grid Infrastructure and RAC courses I teach is “How do you manage Oracle password files in a clustered environment?”. The answer isn’t as straight forward as you might think because there are significant differences between ASM and RAC (==clustered database) environments. Additionally, in recent releases changes were made concerning password file […]

Glenn Fawcett's Oracle blog

Statspack on RAC

Some time ago I was on a client site which was busy setting up a RAC cluster using Oracle 10g. Although I wasn’t involved with that particular system there were a couple of coffee-breaks where I ended up chatting with the consultant that the client had hired to install the cluster. In one of our breaks he surprised me by making the casual comment: “Of course, you can’t run Statspack on RAC because it locks up the system.”

Now there’s no reason why Statspack should lock up a RAC cluster – in principle. But this was an 8-node cluster and if you set up the automatic job to take snapshots by running the default spauto.sql script all eight nodes could start running every hour on the hour – so they would all be fighting constantly for every block of every table and index in the Statspack schema.  (I’m exaggerating for effect, of course, but not by much). You might not notice the global cache contention in a 2-node cluster but eight nodes could, indeed, “lock up the system” at this point.

Ideally, of course, you would like to run all eight snapshot simultaneously and concurrently so that you get information from the same moment across all nodes. In practice you have to stagger the snapshots.

The code for taking snapshots for the AWR (automatic workload repository) gets around this problem by using the ‘WF’ lock for cross-instance synchronisation. One instance gets the lock exclusively, and that’s the instance that sets the snapshot time and coordinates the flushing to disc for all the other instances. (The other instances collect their stats at the same moment – but are cued to flush them to disc one after the other)

For Statspack you have to do something manually – and the simplest approach that people sometimes take is to run a slightly different version of sp_auto.sql on each node so that each node is scheduled to start its snapshot a couple of minutes after the previous one. But there is an alternative that is a little smarter and eliminates the risk of two instances overlapping due to time slippage or a slow snapshot.

Taking the idea from AWR we can make use of dbms_lock to ensure that each snapshot starts as soon as possible after the previous one. We simply create a wrapper procedure for statspack that tries to take a specified user-defined lock in exclusive mode before calling the snapshot procedure and releasing the lock after the snapshot is complete. We can schedule this procedure to run on every node at the start of each hour – which means every node on the system will try to take lock simultaneously – but only one node will get it and the rest will queue. After a node completes its snapshot and releases the lock the next node in the queue immediately acquires the lock and starts its snapshot.

Here’s some sample code for the wrapper. I’ve included a couple of debug messages, and a piece of code that makes the procedure time out without taking a snapshot if it has to wait for more than 10 minutes. Note that you have to grant execute on dbms_lock to perfstat for this to work.

create or replace procedure rac_statspack (i_snap_level in number) as
        m_status        number(38);
        m_handle        varchar2(60);
begin

        sys.dbms_lock.allocate_unique(
                lockname        => 'Synchronize Statspack',
                lockhandle      => m_handle
        );

        m_status := sys.dbms_lock.request(
                lockhandle              => m_handle,
                lockmode                => dbms_lock.x_mode,
                timeout                 => 600,         -- default is dbms_lock.maxwait
                release_on_commit       => false        -- which is the default
        );

        if (m_status = 0 ) then
                dbms_output.put_line(
                        to_char(sysdate,'dd hh24:mi:ss') ||
                        ': Acquired lock, running statspack'
                );

                statspack.snap(i_snap_level);

                dbms_output.put_line(
                        to_char(sysdate,'dd hh24:mi:ss') ||
                        ': Snapshot completed'
                );

                m_status := sys.dbms_lock.release(
                        lockhandle      => m_handle
                );
        else
                dbms_output.put_line(
                        to_char(sysdate,'dd hh24:mi:ss') ||
                        case m_status
                                when 1 then ': Lock wait timed out'
                                when 2 then ': deadlock detected'
                                when 3 then ': parameter error'
                                when 4 then ': already holding lock'
                                when 5 then ': illegal lock handle'
                                       else ': unknown error'
                        end
                );
        end if;

end;
/

And here’s a little bit of pl/sql you can run on each node in turn to install the procedure under dbms_job.

declare
	m_job	number;
	m_inst	number;
	m_date	date;
	m_jqs	number;

begin
	select	instance_number
	into	m_inst
	from	v$instance;

	dbms_job.submit(
		job		=> m_job,
		what		=> 'rac_statspack(7);',
		next_date	=> trunc(sysdate + 1 / 24,'HH'),
		interval	=> 'trunc(SYSDATE + 1 / 24,''HH'')',
		no_parse	=> TRUE,
		instance	=> m_inst,
		force		=> true
	);
	commit;

	select
		next_date
	into	m_date
	from	dba_jobs
	where	job = m_job
	;

	select
		value
	into	m_jqs
	from	v$parameter
	where	name = 'job_queue_processes'
	;

	dbms_output.put_line('Job number: ' || m_job);
	dbms_output.put_line('Next run time: ' || to_char(m_date,'dd-Mon-yyyy hh24:mi:ss'));
	dbms_output.put_line('Current Job Queues: ' || m_jqs || ' (must be greater than zero)');

end;
/

Warning: Judging by the date stamps on my files it’s at least 18 months since I last took this approach with a system – so (a) you might want to test it carefully before you use it and (b) you might want to modify the code to use dbms_scheduler to run the job rather than dbms_job.

[Further Reading on Statspack]

Adding storage dynamically to ASM on Linux

Note: This discussion is potentially relevant only to OEL 5.x and RHEL 5.x- I haven’t been able to verify that it works the same way on other Linux distributions. I would assume so though. Before starting with the article, here are some facts:

  • OEL/RHEL 5.5 64bit
  • Oracle 11.2.0.2
  • native multipathing: device-mapper-multipath

The question I have asked myself many times is: how can I dynamically add a LUN to ASM without having to stop any component of the stack? Mocking “reboot-me” OS’s like Windows I soon was quiet when it came to discussing the addition of a LUN to ASM on Linux. Today I learned how to do this, by piecing together information I got from Angus Thomas, a great Red Hat system administrator I had the pleassure to work with in 2009 and 2010. And since I have a short lived memory I decided to write it down.

I’ll describe the process from the top to bottom, from the addition of the LUN to the server all the way up to the addition of the ASM disk to the disk group.

Adding the storage to the cluster nodes

The first step is to obviosuly get the LUN assigned to the server(s). This is the easy part, and outside of the control of the Linux/Oracle admin. The storage team will provision a new LUN to the hosts in question. At this stage, Linux has no idea about the new storage: to make it available, the system administrator has to rescan the SCSI bus. A proven and tested way in RHEL 5 is to issue this command:

[root@node1 ~]# for i in `ls -1 /sys/class/scsi_host`; do
> echo "- - -" > /sys/class/scsi_host/${i}/scan
> done

The new, unpartitioned LUN will appear in /proc/partitions. If it doesn’t then there’s probably something wrong on the SAN side-check /var/log/messages and talk to your storage administrator. If it’s not a misconfiguration then you may not have an option but to reboot the node.

Configure Multipathing

So far so good, the next step is to add it to the multipathing. First of all, you need to find out what the new WWID of the device is. In my case that’s simple: the last new line in /proc/partitions is usually a giveaway. If you are unsure, ask the man who can check the WWID a console to the array. It’s important to get this right at this stage :)

To add the new disk to the multipath.conf file, all you need to do is to add a new section, as in the following example:


multipaths {
..
multipath {
wwid 360000970000294900664533030344239
alias ACFS0001
path_grouping_policy failover
}
..
}

By the way, I have written a more detailed post about configuring multipathing in a previous blog post here. Don’t forget to replicate the changes to the other cluster nodes!

Now  you reload multipathd using /etc/init.d/multipathd reload on each node, and voila, you should see the device in /dev/mapper/ – my ACFS disk appeared as /dev/mapper/ACFS0001.

Now the tricky bit is to partition it (if you need to-it’s no longer mandatory with 11.1 and newer. Some software like EMC’s Replication Manager requires you to though). I succeeded in doing so by checking the device in /dev/disk/by-id and then using fdisk against it as in this example:

...
# fdisk /dev/disk/by-id/scsi-360000970000294900664533030344239
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 23251.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
 (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): n
Command action
 e   extended
 p   primary partition (1-4)
p
Partition number (1-4): 1
First sector (32-47619839, default 32): 128
Last sector or +size or +sizeM or +sizeK (129-47619839, default 47619839):
Using default value 47619839

Command (m for help): p

Disk /dev/disk/by-id/scsi-360000970000294900664533030344239: 24.3 GB, 24381358080 bytes
64 heads, 32 sectors/track, 23251 cylinders, total 47619840 sectors
Units = sectors of 1 * 512 = 512 bytes

 Device Boot                                                         Start  End         Blocks     Id  System
/dev/disk/by-id/scsi-360000970000294900664533030344239p1             128    47619839    23809855+  83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Once you are in fdisk, the commands are identical to single-pathed storage. Type “n” to create a new partition, “p” for a primary and specify the start and end cylinders as needed. Type “p” to print the partition table, and if you are happy with it use “w” to write it. You might wonder why I added an offset and changed the unit (“u”)-this is due to the EMC storage this site uses.  The EMC® Host Connectivity Guide for Linux (P/N 300-003-865 REV A23) suggests a 64k offset. Don’t simply repeat this in your environment-check with the storage team first.

Before adding the partitions to ACFS0001 and ACFS0002 I had 107 partitions:


[root@node1 ~]# wc -l /proc/partitions
107 /proc/partitions

The new partitions are recognised after the 2 fdisk commands completed:


[root@node1 ~]# wc -l /proc/partitions
 107 /proc/partitions

But when you check /dev/mapper now you still don’t see the partition-the naming convention is to append pn to the device name, i.e. /dev/mapper/ACFS0001p1 for the first partition and so on.

kpartx to the rescue! This superb utility can read the partition table of a device and modify it. Initially my setup was as follows:


[root@node1 ~]# ls -l /dev/mapper/ACFS*
brw-rw---- 1 root disk 253, 31 Jan 18 10:05 /dev/mapper/ACFS0001
brw-rw---- 1 root disk 253, 32 Jan 18 10:05 /dev/mapper/ACFS0002

Usually I would have rebooted the node at this stage as I didn’t know about how to update the partition table. But with kpartx (“yum install kpartx” to install) this is no longer needed. Consider the below example:

[root@node1 ~]# kpartx -l /dev/mapper/ACFS0001
ACFS0001p1 : 0 47619711 /dev/mapper/ACFS0001 129
[root@node1 ~]# kpartx -a /dev/mapper/ACFS0001
[root@node1 ~]# kpartx -l /dev/mapper/ACFS0002
ACFS0002p1 : 0 47619711 /dev/mapper/ACFS0002 129
[root@node1 ~]# kpartx -a /dev/mapper/ACFS0002

[root@node1 ~]# ls -l /dev/mapper/ACFS000*
brw-rw---- 1 root disk 253, 31 Jan 18 10:05 /dev/mapper/ACFS0001
brw-rw---- 1 root disk 253, 36 Jan 18 10:13 /dev/mapper/ACFS0001p1
brw-rw---- 1 root disk 253, 32 Jan 18 10:05 /dev/mapper/ACFS0002
brw-rw---- 1 root disk 253, 37 Jan 18 10:13 /dev/mapper/ACFS0002p1

“kpartx -l” prints the partition table, and “kpartx -a” adds it as the example shows. No more need to reboot! However, as it’s been pointed out in the comments section (see below), kpartx doesn’t use/add both paths, so you should run the partprobe command to add the missing paths:


[root@node1 ~]# partprobe
[root@node1 ~]# wc -l /proc/partitions
109 /proc/partitions

 

 

See how there are 109 partitions listed now instead of just 107 from before-the 2 missing paths have been added (one for each device).

Add disks to ASM

With this done, you can add the disk to ASM – I personally like the intermediate step to create and ASMLib disk. Connect to ASM as sysasm and add the disk using the alter diskgroup command:

SQL> alter diskgroup ACFSDG add disk 'ORCL:ACFS0002', 'ORCL:ACFS0001';

Now just wait for the rebalance operation to complete.

New Indexing Seminars Scheduled For Vienna and Tallinn (New Europeans)

I have two Oracle Index Internals and Best Practices seminars scheduled in the coming months, in Austria (Vienna) and Estonia (Tallinn). These could very well be my last seminars for quite a while as I’m unsure whether I’ll be able to travel again this year, so this could be your last opportunity to see me [...]