January 15, 2011 While reading the “Pro Oracle SQL” book I learned something interesting. Commenting your work can improve database performance. You certainly are aware that thoroughly documenting what you do could prevent hours of headaches that might appear later when trying to investigate problems or add additional features to an existing procedure (I think that was [...]
Not many days pass that I don’t think a time or two about James R. Harkey. Mr. Harkey was my high school mathematics teacher. He taught me algebra, geometry, analytic geometry, trigonometry, and calculus. What I learned from Mr. Harkey influences—to this day—how I write, how I teach, how I plead a case, how I troubleshoot, .... These are the skills I’ve used to earn everything I own.
I saw an interesting post recently where Greg Rahn talked about HCC mechanics. He claimed that an update to a record stored in HCC format did not require decompressing the whole Compression Unit (CU) which consist of several Oracle blocks. I’m assuming by this he meant that all the records contained in the CU did not get written back to storage in a non-HCC format due to a single record being updated. Greg then showed an example proving row migration occurred for an updated record. He didn’t show that the other records had not been decompressed though. So since I was already working on an HCC chapter for the upcoming Apress Exadata book, I thought I would take time off from the book writing to post this (hopefully the editors will forgive me).
Here’s the recipe: Basically we’ll update a single row, see that its rowid has changed, veify that we can still get to the record via its original rowid, and check to see if the TABLE FETCH CONTINUED ROW statistic gets updated when we we access the row via its original rowid, thus proving basic row migration (this is what Greg has already shown). Then we’ll look at block dumps for the original and new block to see what’s there.
-bash-3.2$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 14 14:16:20 2011 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367; ROWID LOCATION ------------------ -------------------- AAATCBAAIAAF8uSFc9 8.1559442.22333 SYS@SANDBOX1> -- so my row is in file 8, block 1559442, slot 22333 SYS@SANDBOX1> SYS@SANDBOX1> update kso.skew_hcc3 set col1=col1 where pk_col=16367; 1 row updated. SYS@SANDBOX1> select rowid, old_rowid(rowid) location from kso.skew_hcc3 where pk_col=16367; ROWID LOCATION ------------------ -------------------- AAATCBAAHAAMGMMAAA 7.3171084.0 SYS@SANDBOX1> -- Ha! The rowid has changed – the row moved to file 7, block 3171084, slot 0 SYS@SANDBOX1> SYS@SANDBOX1> -- Let's see if we can still get to it via the original rowid SYS@SANDBOX1> SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9'; PK_COL ---------- 16367 SYS@SANDBOX1> -- Yes we can! – can we use the new rowid? SYS@SANDBOX1> SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAHAAMGMMAAA'; PK_COL ---------- 16367 SYS@SANDBOX1> -- That works too! – It’s a migrated Row! SYS@SANDBOX1> -- Let’s verify with “continued row” stat SYS@SANDBOX1> SYS@SANDBOX1> @mystats Enter value for name: table fetch continued row NAME VALUE ---------------------------------------------------------------------- --------------- table fetch continued row 2947 SYS@SANDBOX1> -- select via the original rowid SYS@SANDBOX1> SYS@SANDBOX1> select pk_col from kso.skew_hcc3 where rowid = 'AAATCBAAIAAF8uSFc9'; PK_COL ---------- 16367 SYS@SANDBOX1> @mystats Enter value for name: table fetch continued row NAME VALUE ---------------------------------------------------------------------- --------------- table fetch continued row 2948 SYS@SANDBOX1> -- Stat is incremented – so definitely a migrated row! |
So the row has definitely been migrated. Now let’s verify that the migrated row is not compressed. We can do this by dumping the block where the newly migrated record resides.
SYS@SANDBOX1> !cat dump_block.sql @find_trace alter system dump datafile &fileno block &blockno; SYS@SANDBOX1> @dump_block TRACEFILE_NAME ------------------------------------------------------------------------------------ /u01/app/oracle/diag/rdbms/sandbox/SANDBOX1/trace/SANDBOX1_ora_5191.trc Enter value for fileno: 7 Enter value for blockno: 3171084 System altered. |
Now let’s look at the trace file produced in the trace directory. Here is an excerpt from the block dump.
Block header dump: 0x01f0630c Object id on Block? Y seg/obj: 0x13081 csc: 0x01.1e0574d4 itc: 3 flg: E typ: 1 - DATA brn: 0 bdba: 0x1f06300 ver: 0x01 opc: 0 inc: 0 exflg: 0 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x002f.013.00000004 0x00eec383.01f2.44 ---- 1 fsc 0x0000.00000000 0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000 0x03 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000 bdba: 0x01f0630c data_block_dump,data header at 0x2b849c81307c =============== tsiz: 0x1f80 hsiz: 0x14 pbl: 0x2b849c81307c 76543210 flag=-------- ntab=1 nrow=1 frre=-1 fsbo=0x14 fseo=0x1f60 avsp=0x1f4c tosp=0x1f4c 0xe:pti[0] nrow=1 offs=0 0x12:pri[0] offs=0x1f60 block_row_dump: tab 0, row 0, @0x1f60 tl: 32 fb: --H-FL-- lb: 0x1 cc: 5 col 0: [ 4] c3 02 40 44 col 1: [ 2] c1 02 col 2: [10] 61 73 64 64 73 61 64 61 73 64 col 3: [ 7] 78 6a 07 15 15 0b 32 col 4: [ 1] 59 end_of_block_dump |
Notice that there is only one row in the block (nrow=1). Also notice that the object_id is included in the block (in hex format). It is labeled “seg/obj:”. The table has 5 columns. The values are displayed – also in hex format. Just to verify that we have the right block and row we can translate the object_id and the value of the first column as follows:
SYS@SANDBOX1> !cat obj_by_hex.sql col object_name for a30 select owner, object_name, object_type from dba_objects where object_id = to_number(replace('&hex_value','0x',''),'XXXXXX'); SYS@SANDBOX1> @obj_by_hex Enter value for hex_value: 0x13081 OWNER OBJECT_NAME OBJECT_TYPE ------------------------------ ------------------------------ ------------------- KSO SKEW_HCC3 TABLE SYS@SANDBOX1> desc kso.skew_hcc3 Name Null? Type ----------------------------- -------- -------------------- PK_COL NUMBER COL1 NUMBER COL2 VARCHAR2(30) COL3 DATE COL4 VARCHAR2(1) SYS@SANDBOX1> !cat display_raw.sql col value for a50 select display_raw(replace('&string',' ',''),nvl('&TYPE','VARCHAR2')) Value from dual / SYS@SANDBOX1> @display_raw Enter value for string: c3 02 40 44 Enter value for type: NUMBER VALUE -------------------------------------------------- 16367 |
As you can see, this is the record that we updated earlier in the SKEW_HCC3 table. Note: display_raw.sql depends on a little function called display_raw() which coincidentally I got from Greg Rahn. Here’s a script to create the function: create_display_raw.sql
Now for a quick look back at the original block (note that in my testing I updated another row in this same block):
=============== tab 0, row 1, @0x32 tl: 5086 fb: --H-F--N lb: 0x3 cc: 1 nrid: 0x0217cb93.0 col 0: [5074] Compression level: 03 (Archive Low) Length of CU row: 5074 kdzhrh: ------PC CBLK: 1 Start Slot: 00 NUMP: 01 PNUM: 00 POFF: 5054 PRID: 0x0217cb93.0 CU header: CU version: 0 CU magic number: 0x4b445a30 CU checksum: 0x982dec03 CU total length: 11403 CU flags: NC-U-CRD-OP ncols: 5 nrows: 32759 algo: 0 CU decomp length: 7266 len/value length: 945436 row pieces per row: 1 num deleted rows: 2 deleted rows: 22333, 30848, START_CU: 00 00 13 d2 1f 01 00 00 00 01 00 00 13 be 02 17 cb 93 00 00 00 4b 44 5a 30 03 ec 2d 98 00 00 2c 8b eb 06 00 05 7f f7 00 0e 6d 1c 01 00 02 00 00 00 00 |
So this little except shows that this is an HCC compressed block (Compression level: 03 (Archive Low) and many CU references). The nrows line shows us that the block contains 32759 rows. It also shows that 2 rows have been deleted from the block (num deleted rows). Notice that one of the deleted rows is the one in slot 22333 (sound familiar). If you’ll look back at the original rowid in the old format (fileno.blockno.slot) you’ll see that it is the row we updated. It was “deleted” from this block when it was migrated to the new block. Of course there is still a pointer left behind.
SYS@SANDBOX1> select old_rowid('AAATCBAAIAAF8uSFc9') location from dual; LOCATION ----------------------------------------- 8.1559442.22333 SYS@SANDBOX1> select old_rowid(rowid) location, a.* from kso.skew_hcc3 a where rowid = 'AAATCBAAIAAF8uSFc9'; LOCATION PK_COL COL1 COL2 COL3 C -------------------- ---------- ---------- ------------------------------ --------- - 7.3171084.0 16367 1 asddsadasd 21-JUL-06 Y |
It’s hard to prove a negative, but it does not appear that any records are decompressed other than those that are actually updated. The other rows in the block appear to remain in HCC format.
A recurring question during Grid Infrastructure and RAC courses I teach is “How do you manage Oracle password files in a clustered environment?”. The answer isn’t as straight forward as you might think because there are significant differences between ASM and RAC (==clustered database) environments. Additionally, in recent releases changes were made concerning password file […]
Some time ago I was on a client site which was busy setting up a RAC cluster using Oracle 10g. Although I wasn’t involved with that particular system there were a couple of coffee-breaks where I ended up chatting with the consultant that the client had hired to install the cluster. In one of our breaks he surprised me by making the casual comment: “Of course, you can’t run Statspack on RAC because it locks up the system.”
Now there’s no reason why Statspack should lock up a RAC cluster – in principle. But this was an 8-node cluster and if you set up the automatic job to take snapshots by running the default spauto.sql script all eight nodes could start running every hour on the hour – so they would all be fighting constantly for every block of every table and index in the Statspack schema. (I’m exaggerating for effect, of course, but not by much). You might not notice the global cache contention in a 2-node cluster but eight nodes could, indeed, “lock up the system” at this point.
Ideally, of course, you would like to run all eight snapshot simultaneously and concurrently so that you get information from the same moment across all nodes. In practice you have to stagger the snapshots.
The code for taking snapshots for the AWR (automatic workload repository) gets around this problem by using the ‘WF’ lock for cross-instance synchronisation. One instance gets the lock exclusively, and that’s the instance that sets the snapshot time and coordinates the flushing to disc for all the other instances. (The other instances collect their stats at the same moment – but are cued to flush them to disc one after the other)
For Statspack you have to do something manually – and the simplest approach that people sometimes take is to run a slightly different version of sp_auto.sql on each node so that each node is scheduled to start its snapshot a couple of minutes after the previous one. But there is an alternative that is a little smarter and eliminates the risk of two instances overlapping due to time slippage or a slow snapshot.
Taking the idea from AWR we can make use of dbms_lock to ensure that each snapshot starts as soon as possible after the previous one. We simply create a wrapper procedure for statspack that tries to take a specified user-defined lock in exclusive mode before calling the snapshot procedure and releasing the lock after the snapshot is complete. We can schedule this procedure to run on every node at the start of each hour – which means every node on the system will try to take lock simultaneously – but only one node will get it and the rest will queue. After a node completes its snapshot and releases the lock the next node in the queue immediately acquires the lock and starts its snapshot.
Here’s some sample code for the wrapper. I’ve included a couple of debug messages, and a piece of code that makes the procedure time out without taking a snapshot if it has to wait for more than 10 minutes. Note that you have to grant execute on dbms_lock to perfstat for this to work.
create or replace procedure rac_statspack (i_snap_level in number) as m_status number(38); m_handle varchar2(60); begin sys.dbms_lock.allocate_unique( lockname => 'Synchronize Statspack', lockhandle => m_handle ); m_status := sys.dbms_lock.request( lockhandle => m_handle, lockmode => dbms_lock.x_mode, timeout => 600, -- default is dbms_lock.maxwait release_on_commit => false -- which is the default ); if (m_status = 0 ) then dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || ': Acquired lock, running statspack' ); statspack.snap(i_snap_level); dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || ': Snapshot completed' ); m_status := sys.dbms_lock.release( lockhandle => m_handle ); else dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || case m_status when 1 then ': Lock wait timed out' when 2 then ': deadlock detected' when 3 then ': parameter error' when 4 then ': already holding lock' when 5 then ': illegal lock handle' else ': unknown error' end ); end if; end; /
And here’s a little bit of pl/sql you can run on each node in turn to install the procedure under dbms_job.
declare m_job number; m_inst number; m_date date; m_jqs number; begin select instance_number into m_inst from v$instance; dbms_job.submit( job => m_job, what => 'rac_statspack(7);', next_date => trunc(sysdate + 1 / 24,'HH'), interval => 'trunc(SYSDATE + 1 / 24,''HH'')', no_parse => TRUE, instance => m_inst, force => true ); commit; select next_date into m_date from dba_jobs where job = m_job ; select value into m_jqs from v$parameter where name = 'job_queue_processes' ; dbms_output.put_line('Job number: ' || m_job); dbms_output.put_line('Next run time: ' || to_char(m_date,'dd-Mon-yyyy hh24:mi:ss')); dbms_output.put_line('Current Job Queues: ' || m_jqs || ' (must be greater than zero)'); end; /
Warning: Judging by the date stamps on my files it’s at least 18 months since I last took this approach with a system – so (a) you might want to test it carefully before you use it and (b) you might want to modify the code to use dbms_scheduler to run the job rather than dbms_job.
Note: This discussion is potentially relevant only to OEL 5.x and RHEL 5.x- I haven’t been able to verify that it works the same way on other Linux distributions. I would assume so though. Before starting with the article, here are some facts:
The question I have asked myself many times is: how can I dynamically add a LUN to ASM without having to stop any component of the stack? Mocking “reboot-me” OS’s like Windows I soon was quiet when it came to discussing the addition of a LUN to ASM on Linux. Today I learned how to do this, by piecing together information I got from Angus Thomas, a great Red Hat system administrator I had the pleassure to work with in 2009 and 2010. And since I have a short lived memory I decided to write it down.
I’ll describe the process from the top to bottom, from the addition of the LUN to the server all the way up to the addition of the ASM disk to the disk group.
Adding the storage to the cluster nodes
The first step is to obviosuly get the LUN assigned to the server(s). This is the easy part, and outside of the control of the Linux/Oracle admin. The storage team will provision a new LUN to the hosts in question. At this stage, Linux has no idea about the new storage: to make it available, the system administrator has to rescan the SCSI bus. A proven and tested way in RHEL 5 is to issue this command:
[root@node1 ~]# for i in `ls -1 /sys/class/scsi_host`; do > echo "- - -" > /sys/class/scsi_host/${i}/scan > done
The new, unpartitioned LUN will appear in /proc/partitions. If it doesn’t then there’s probably something wrong on the SAN side-check /var/log/messages and talk to your storage administrator. If it’s not a misconfiguration then you may not have an option but to reboot the node.
Configure Multipathing
So far so good, the next step is to add it to the multipathing. First of all, you need to find out what the new WWID of the device is. In my case that’s simple: the last new line in /proc/partitions is usually a giveaway. If you are unsure, ask the man who can check the WWID a console to the array. It’s important to get this right at this stage :)
To add the new disk to the multipath.conf file, all you need to do is to add a new section, as in the following example:
multipaths { .. multipath { wwid 360000970000294900664533030344239 alias ACFS0001 path_grouping_policy failover } .. }
By the way, I have written a more detailed post about configuring multipathing in a previous blog post here. Don’t forget to replicate the changes to the other cluster nodes!
Now you reload multipathd using /etc/init.d/multipathd reload on each node, and voila, you should see the device in /dev/mapper/ – my ACFS disk appeared as /dev/mapper/ACFS0001.
Now the tricky bit is to partition it (if you need to-it’s no longer mandatory with 11.1 and newer. Some software like EMC’s Replication Manager requires you to though). I succeeded in doing so by checking the device in /dev/disk/by-id and then using fdisk against it as in this example:
... # fdisk /dev/disk/by-id/scsi-360000970000294900664533030344239 Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. The number of cylinders for this disk is set to 23251. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): u Changing display/entry units to sectors Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First sector (32-47619839, default 32): 128 Last sector or +size or +sizeM or +sizeK (129-47619839, default 47619839): Using default value 47619839 Command (m for help): p Disk /dev/disk/by-id/scsi-360000970000294900664533030344239: 24.3 GB, 24381358080 bytes 64 heads, 32 sectors/track, 23251 cylinders, total 47619840 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/disk/by-id/scsi-360000970000294900664533030344239p1 128 47619839 23809855+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
Once you are in fdisk, the commands are identical to single-pathed storage. Type “n” to create a new partition, “p” for a primary and specify the start and end cylinders as needed. Type “p” to print the partition table, and if you are happy with it use “w” to write it. You might wonder why I added an offset and changed the unit (“u”)-this is due to the EMC storage this site uses. The EMC® Host Connectivity Guide for Linux (P/N 300-003-865 REV A23) suggests a 64k offset. Don’t simply repeat this in your environment-check with the storage team first.
Before adding the partitions to ACFS0001 and ACFS0002 I had 107 partitions:
[root@node1 ~]# wc -l /proc/partitions 107 /proc/partitions
The new partitions are recognised after the 2 fdisk commands completed:
[root@node1 ~]# wc -l /proc/partitions 107 /proc/partitions
But when you check /dev/mapper now you still don’t see the partition-the naming convention is to append pn to the device name, i.e. /dev/mapper/ACFS0001p1 for the first partition and so on.
kpartx to the rescue! This superb utility can read the partition table of a device and modify it. Initially my setup was as follows:
[root@node1 ~]# ls -l /dev/mapper/ACFS* brw-rw---- 1 root disk 253, 31 Jan 18 10:05 /dev/mapper/ACFS0001 brw-rw---- 1 root disk 253, 32 Jan 18 10:05 /dev/mapper/ACFS0002
Usually I would have rebooted the node at this stage as I didn’t know about how to update the partition table. But with kpartx (“yum install kpartx” to install) this is no longer needed. Consider the below example:
[root@node1 ~]# kpartx -l /dev/mapper/ACFS0001 ACFS0001p1 : 0 47619711 /dev/mapper/ACFS0001 129 [root@node1 ~]# kpartx -a /dev/mapper/ACFS0001 [root@node1 ~]# kpartx -l /dev/mapper/ACFS0002 ACFS0002p1 : 0 47619711 /dev/mapper/ACFS0002 129 [root@node1 ~]# kpartx -a /dev/mapper/ACFS0002 [root@node1 ~]# ls -l /dev/mapper/ACFS000* brw-rw---- 1 root disk 253, 31 Jan 18 10:05 /dev/mapper/ACFS0001 brw-rw---- 1 root disk 253, 36 Jan 18 10:13 /dev/mapper/ACFS0001p1 brw-rw---- 1 root disk 253, 32 Jan 18 10:05 /dev/mapper/ACFS0002 brw-rw---- 1 root disk 253, 37 Jan 18 10:13 /dev/mapper/ACFS0002p1
“kpartx -l” prints the partition table, and “kpartx -a” adds it as the example shows. No more need to reboot! However, as it’s been pointed out in the comments section (see below), kpartx doesn’t use/add both paths, so you should run the partprobe command to add the missing paths:
[root@node1 ~]# partprobe [root@node1 ~]# wc -l /proc/partitions 109 /proc/partitions
See how there are 109 partitions listed now instead of just 107 from before-the 2 missing paths have been added (one for each device).
Add disks to ASM
With this done, you can add the disk to ASM – I personally like the intermediate step to create and ASMLib disk. Connect to ASM as sysasm and add the disk using the alter diskgroup command:
SQL> alter diskgroup ACFSDG add disk 'ORCL:ACFS0002', 'ORCL:ACFS0001';
Now just wait for the rebalance operation to complete.
I have two Oracle Index Internals and Best Practices seminars scheduled in the coming months, in Austria (Vienna) and Estonia (Tallinn). These could very well be my last seminars for quite a while as I’m unsure whether I’ll be able to travel again this year, so this could be your last opportunity to see me [...]
Oracle Advanced CompressionによるTableの圧縮はParallel Queryのパフォーマンスにも大きく影響するOracle Advanced Compression White Paperより:
大規模DWHシステムの場合、物理ディスクからの読み込みデータ量を減らすことがパフォーマンス改善のキーとなる。TPC-Hベンチマークで、その影響度を見てみると(http://ichiroobata.blogspot.com/2010/08/partitioning.htmlより):
Recent comments
1 year 45 weeks ago
2 years 5 weeks ago
2 years 10 weeks ago
2 years 10 weeks ago
2 years 15 weeks ago
2 years 36 weeks ago
3 years 4 weeks ago
3 years 34 weeks ago
4 years 18 weeks ago
4 years 19 weeks ago