The (M)other (O)f (A)ll (T)uning (S)cripts. A tuning and diagnostics utility for sqlplus co-developed with Tanel Poder. Includes active session sampling and a TOP-style utility for displaying instance activity at regular refresh intervals. Supports versions from 10g Release 2 onwards. September 2010
So, Here I was merrily enjoying OpenWorld 2010 presentations in SFO, I got a call from a client about a performance issue. Client recently upgraded from Version 9i to Version 10g in an E-Business environment. I had the privilege of consulting before the upgrade, so we setup the environment optimally, and upgrade itself was seamless. Client did not see much regression except One query: That query was running for hours in 10g compared to 15 minutes in 9i.
Review and Analysis
Reviewed the execution plan in the development database and I did not see any issues with the plan. Execution plan in development and production looked decent enough. I wasn’t able to reproduce the issue in the development database either. So, the client allowed me to trace the SQL statement in the production database. Since the size of data in few tables is different between production and development databases, we had to analyze the problem in production environment.
I had to collect as much data possible as the tracing was a one-time thing. I setup a small script to get process stack and process memory area of that Unix dedicated server process to collect more details, in addition to tracing the process with waits => true.
Execution plan from the production database printed below. [ Review the execution plan carefully, it is giving away the problem immediately.] One execution of this statement took 13,445 seconds and almost all of it spent in the CPU time. Why would the process consume 13,719 seconds of CPU time?. Same process completed in just 15 minutes in 9i, as confirmed by Statspack reports. [ As a side note, We collected enormous amount of performance data in 9i in the Production environment before upgrading to 10g, just so that we can quickly resolve any performance issues, and you should probably follow that guideline too]. That collection came handy and It is clear that SQL statement was completing in 15 minutes in 9i and took nearly 3.75 hours after upgrading the database to version 10g.
call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 10 13719.71 13445.94 27 5086407 0 99938 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 12 13719.71 13445.94 27 5086407 0 99938 24 HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us) 24 NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us) 24 NESTED LOOPS (cr=4903935 pr=27 pw=0 time=133347961 us) 489983 NESTED LOOPS (cr=3432044 pr=27 pw=0 time=104239982 us) 489983 NESTED LOOPS (cr=2452078 pr=27 pw=0 time=91156653 us) 489983 TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=1472112 pr=27 pw=0 time=70907109 us) 489983 INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=981232 pr=0 pw=0 time=54338789 us)(object id 43397) 489983 INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=979966 pr=0 pw=0 time=17972426 us)(object id 37657) 489983 INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=979966 pr=0 pw=0 time=10876601 us)(object id 43498) 24 INDEX RANGE SCAN UXPP_FA_LOCATIONS_N3 (cr=1471891 pr=0 pw=0 time=27325172 us)(object id 316461) 24 TABLE ACCESS BY INDEX ROWID PER_ALL_PEOPLE_F (cr=96 pr=0 pw=0 time=2191 us) 24 INDEX RANGE SCAN PER_PEOPLE_F_PK (cr=72 pr=0 pw=0 time=1543 us)(object id 44403)
pstack, pmap, and truss
Reviewing pstack output generated from the script shows many function calls kghfrempty, kghfrempty_ex, qerghFreeHashTable etc, implying hash table operations. Something to do with hash table consuming time?
( Only partial entries shown ) 0000000103f41528 kghfrempty 0000000103f466ec kghfrempty_ex 0000000103191f1c qerghFreeHashTable 000000010318e080 qerghFetch 00000001030b1b3c qerstFetch ... 0000000103f41558 kghfrempty 0000000103f466ec kghfrempty_ex 0000000103191f1c qerghFreeHashTable 000000010318e080 qerghFetch 00000001030b1b3c qerstFetch
Truss of the process also showed quite a bit of mmap calls. So, the process is allocating more memory to an hash table?
... mmap(0xFFFFFFFF231C0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231C0000 ... pollsys(0xFFFFFFFF7FFF7EC8, 1, 0xFFFFFFFF7FFF7E00, 0x00000000) = 0 mmap(0xFFFFFFFF231D0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0xFFFFFFFF231D0000 ...
Execution plan again ..
Reviewing the execution plan again showed an interesting issue. I am going to post only two relevant lines from the execution plan below. As you can see that elapsed time at NESTED LOOPS OUTER step is 136 seconds. But the elapsed time at the next HASH GROUP BY step is 13240 seconds, meaning nearly 13,100 seconds spent in the HASH GROUP BY Step alone! Why would the process spend 13,100 seconds in a group by operation? Actual SQL execution took only 136 seconds, but the group by operation took 13,100 seconds. That doesn’t make sense, Does it?
24 HASH GROUP BY (cr=4904031 pr=27 pw=0 time=13240600266 us) 24 NESTED LOOPS OUTER (cr=4904031 pr=27 pw=0 time=136204709 us) ...
OFE = 9i
Knowing that time is spent in the Group by operation and that the 10g new feature Hash Grouping method is in use, I decided to test this SQL statement execution with 9i optimizer. The SQL completed in 908 seconds with OFE(optimizer_features_enabled) set to 9.2.0.8 (data is little bit different since production is an active environment). You can also see that SORT technique is used to group the data.
alter session set optimizer_features_enabled=9.2.0.8; Explain plan : call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 106985 887.41 908.25 282379 3344916 158 1604754 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 106987 887.41 908.25 282379 3344916 158 1604754 4 SORT GROUP BY (cr=2863428 pr=0 pw=0 time=37934456 us) 4 NESTED LOOPS OUTER (cr=2863428 pr=0 pw=0 time=34902519 us) 4 NESTED LOOPS (cr=2863412 pr=0 pw=0 time=34198726 us) 286067 NESTED LOOPS (cr=2003916 pr=0 pw=0 time=24285794 us) 286067 NESTED LOOPS (cr=1431782 pr=0 pw=0 time=19288024 us) 286067 TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=859648 pr=0 pw=0 time=13568456 us) 286067 INDEX RANGE SCAN HR_LOCATIONS_UK2 (cr=572969 pr=0 pw=0 time=9271380 us)(object id 43397) 286067 INDEX UNIQUE SCAN MTL_PARAMETERS_U1 (cr=572134 pr=0 pw=0 time=4663154 us)(object id 37657) ...
Knowing the problem is in the GROUP BY step, we setup a profile with _gby_hash_aggregation_enabled set to FALSE, disabling the new 10g feature for that SQL statement. With the SQL profile, performance of the SQL statement is comparable to pre-upgrade timing.
This almost sounds like a bug! Bug 8223928 is matching with this stack, but it is the opposite. Well, client will work with the support to get a bug fix for this issue.
Summary
In summary, you can use scientific methods to debug performance issues. Revealing the details underneath, will enable you to come up with a workaround quickly, leading to a faster resolution.
Note that, I am not saying hash group by feature is bad. Rather, we seem to have encountered an unfortunate bug which caused performance issues at this client. I think, Hash Grouping is a good feature as the efficiency of grouping operations can be improved if you have ample amount of memory. That’s the reason why we disabled this feature at the statement level, NOT at the instance level.
This blog in a traditional format hash_group_by_orainternals
Update 1:
I am adding a script to capture pmap and pstack output in a loop for 1000 times, with 10 seconds interval. Tested in Oracle Solaris.
#! /bin/ksh pid=$1 (( cnt=1000 )) while [[ $cnt -gt 0 ]]; do date pmap -x $pid pstack $pid echo $cnt (( cnt=cnt-1 )) sleep 10 done
To call the script: assuming 7887 is the UNIX pid of the process.
nohup ./pmap_loop.ksh 7887 >> /tmp/a1.lst 2>>/tmp/a1.lst &
Syntax for the truss command is given below. Please remember, you can’t use pmap, pstack and truss concurrently. These commands stops the process (however short that may be!) and inspects them, so use these commands sparingly. [ I had a client who used to run truss on LGWR process on a continuous(!) basis and database used to crash randomly!]. I realize that pmap/pstack/truss can be scripted to work together, but that would involve submitting a background process for the truss command and killing that process after a small timeout window. That would be a risky approach in a Production environment and So, I prefer to use truss command manually and CTRL+C it after few seconds.
truss -d -E -o /tmp/truss.lst -p 7887
I can not stress enough, not to overuse these commands in a Production environment. Command strace( Linux), tusc (HP) are comparable commands of truss(Solaris).
Here’s a hidden threat in the optimizer strategy that may cause performance problems if you’re trying to operate a series of batch updates (or batch deletes).
In the past I’ve pointed out that a predicate like “rownum <= N" generally makes the optimizer use “first_rows(N)” optimisation methods – known in the code as first_k_rows optimisation.
This isn’t true for updates and deletes, as the following simple example indicates:
create table t1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 10000 ; create index t1_i1 on t1(id); -- gather_table_stats, no histograms, compute, cascade explain plan for update t1 set small_vc = upper(small_vc) where id > 100 and rownum <= 200 ; select * from table(dbms_xplan.display); explain plan for select small_vc from t1 where id > 100 and rownum <= 200 ; select * from table(dbms_xplan.display);
As usual I ran this with system statistics (CPU costing) disabled, using a locally managed tablespace with uniform 1MB extents and freelist management – simply because this leads to a repeatable test. Since I was running 11.1.0.6 I didn’t set the db_file_multiblock_read_count parameter (thus allowing the _db_file_optimizer_read_count to default to 8). These are the plans I got for the update and select respectively:
------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost | ------------------------------------------------------------ | 0 | UPDATE STATEMENT | | 200 | 3000 | 27 | | 1 | UPDATE | T1 | | | | |* 2 | COUNT STOPKEY | | | | | |* 3 | TABLE ACCESS FULL| T1 | 9901 | 145K| 27 | ------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ROWNUM<=200) 3 - filter("ID">100) ---------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 200 | 3000 | 6 | |* 1 | COUNT STOPKEY | | | | | | 2 | TABLE ACCESS BY INDEX ROWID| T1 | 200 | 3000 | 6 | |* 3 | INDEX RANGE SCAN | T1_I1 | | | 2 | ---------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=200) 3 - access("ID">100)
Note how the select statement uses an index range scan with stop key as the best strategy for finding 200 rows and then stopping – and the total cost of 6 is the cost of visiting the (well-clustered) table data for two hundred rows. The update statement uses a full tablescan to find the first 200 rows with a total cost of 27 – which happens to be the cost of a completed tablescan, not the cost of “enough of the tablescan to find 200 rows”. The update statement has NOT been optimized with using the first_k_rows strategy – it has used the all_rows strategy.
The demonstration is just a starting-point of course – you need to do several more checks and tests to convince yourself that first_k_rows optimisation isn’t going to appear for updates (and deletes) and to discover why it can be a problem that needs to be addressed. One of the simplest checks is to look at the 10053 (CBO) trace files to see the critical difference, especially to notice what’s in the trace for the select but missing from the trace for the update. The critical lines show the following type of information – but only in the trace file for the select:
First K Rows: K = 200.00, N = 9901.00 First K Rows: Setup end First K Rows: K = 200.00, N = 9901.00 First K Rows: old pf = -1.0000000, new pf = 0.0202000 SINGLE TABLE ACCESS PATH (First K Rows) First K Rows: unchanged join prefix len = 1 Final cost for query block SEL$1 (#0) - First K Rows Plan:
But why might it matter anyway ? Here’s the shape of a piece of SQL, embedded in pl/sql, that I found recently at a client site:
update tabX set col1 = {constant} where col2 in ( complex subquery ) and {list of other predicates} and rownum <= 200 returning id into :bind_array ;
For most of the calls to this SQL there would be a small number of rows ready for update, and the pl/sql calling this update statement would populate an array (note the “returning” clause) with the ids for the rows updated and then do something with those ids. Unfortunately there were occasions when the data (and the statistics about the data) covered tens of thousands of rows that needed the update. When this happened the optimizer chose to unnest the complex subquery – instead of using a very precise and efficient filter subquery approach – and do a massive hash semi-join that took a couple of CPU minutes per 200 rows and hammered the system to death for a couple of hours.
If Oracle had followed the first_k_rows optimizer strategy it would have used the “small data” access path and taken much less time to complete the task. As it was we ended up using hints to force the desired access path – in this case it was sufficient to add a /*+ no_unnest */ hint to the subquery.
Today’s email brought a loaf of spam with this in it:
[Name withheld] is a world-class developer and provider of leading-edge solutions that help customers optimize the physical infrastructure through simplification, agility, and operational efficiency.
This passage is the informational equivalent of this audio file. If you can read it without feeling sad, sarcastic, vaguely scummy, or bitter about humanity’s perverse unwillingness to combine thought and language in a useful way, then I beg you to read Revising Prose and Rework.
Please.
That is all.
I arrived home in Chicago around 1am Saturday morning on a slightly delayed flight direct from San Francisco. What a week – I’m only now getting back into my normal routine!
It’s nothing spectacular, but I wrote this short Haiku (poem) on Sunday…
OpenWorld: crush, splat…
Brain worked overtime last week!
Still catching up sleep.
Haha… ok… so I think that I’m finally caught up on sleep now that it’s Thursday. But it was a busy week!
OpenWorld was worth the plane ticket, for both learning and networking opportunities. My two favorite sessions were at oracle closed world and the unconference, respectively… but there were plenty of great “official” openworld sessions too!
In my last blog entry I alluded to perhaps not being all that happy about Fibre Channel. Well, it’s true. I have been having a love/hate relationship with Fibre Channel for the last ten years or so, and we have now decided to get a divorce. I just can’t stand it any more!
I first fell in love with Fibre Channel in the late 90s: How could I resist the prospect of leaving behind multi-initiator SCSI with all it’s deep, deep electrical issues? Fibre Channel let me hook up multiple hosts to lots of drives, via a switch, and it let me dynamically attach and detach devices from multiple clustered nodes without reboots. Or so I thought. The reality of Fibre Channel is that it was indeed a revelation in its day, but some of that promise never really materialised until recently. And now it’s too late.
It’s an interactive Q&A webinar with Dave Welch the main guy at VMware’s booth at Oracle Open World last week. Dave is pro VMware as one might imagine.
If you have any burning questions about VMware and Oracle and/or positive/negative experiences with Oracle on VMware, drop me a line and I’ll try and get your content, questions and feedback into the webinar. (during the webinar questions can be posted in the online chat)
The current list of questions center around performance overhead on VMware, Oracle’s attitude toward supporting VMware, whether RAC is appropriate and/or supported on VMware and how VMware compares to Oracle’s VM server.
Here’s a little trick I’ve only just discovered (by accident). What’s a tidy way to extract the fractional part from a number ?
Running 11.1.0.6 (I haven’t checked earlier versions yet – or looked at the manuals for the definition of the mod() function):
SQL> select mod(1.34,1) from dual; MOD(1.34,1) ----------- .34 1 row selected. SQL> select mod(-1.34,1) from dual; MOD(-1.34,1) ------------ -.34
I really didn’t think it would make sense to use an expression like mod(p,1) – but it’s pleasant surprise that is does what it does.
Finally time for a new series! With the arrival of the new 11.2.0.2 patchset I thought it was about time to try and set up a virtual 11.2.0.2 extended distance or stretched RAC. So, it’s virtual, fair enough. It doesn’t allow me to test things like the impact of latency on the inter-SAN communication, but it allowed me to test the general setup. Think of this series as a guide after all the tedious work has been done, and SANs happily talk to each other. The example requires some understanding of how XEN virtualisation works, and it’s tailored to openSuSE 11.2 as the dom0 or “host”. I have tried OracleVM in the past but back then a domU (or virtual machine) could not mount an iSCSI target without a kernel panic and reboot. Clearly not what I needed at the time. OpenSuSE has another advantage: it uses a new kernel-not the 3 year old 2.6.18 you find in Enterprise distributions. Also, xen is recent (openSuSE 11.3 even features xen 4.0!) and so is libvirt.
The Setup
The general idea follows the design you find in the field, but with less cluster nodes. I am thinking of 2 nodes for the cluster, and 2 iSCSI target providers. I wouldn’t use iSCSI in the real world, but my lab isn’t connected to an EVA or similar.A third site will provide quorum via an NFS provided voting disk.
Site A will consist of filer01 for the storage part, and edcnode1 as the RAC node. Site B will consist of filer02 and edcnode2. The iSCSI targets are going to be provided by openFiler’s domU installation, and the cluster nodes will make use of Oracle Enterprise Linux 5 update 5.To make it more realistic, site C will consist of another openfiler isntance, filer03 to provide the NFS export for the 3rd voting disk. Note that openFiler seems to support NFS v3 only at the time of this writing. All systems are 64bit.
The network connectivity will go through 3 virtual switches, all “host only” on my dom0.
As in the real world, private and storage network have to be separated to prevent iSCSI packets clashing with Cache Fusion traffic. Also, I increased the MTU for the private and storage networks to 9000 instead of the default 1500. If you like to use jumbo frames you should check if your switch supports it.
Grid Infrastructure will use ASM to store OCR and voting disks, and the inter-SAN replication will also be performed by ASM in normal redundancy. I am planning on using preferred mirror read and intelligent data placement to see if that makes a difference.
Known limitations
This setup has some limitations, such as the following ones:
So much for the introduction-I’ll post the setup step-by-step. The intended series will consist of these articles:
That’s it for today, I hope I got you interested and following the series. It’s been real fun doing it; now it’s about writing it all up.
Recent comments
2 years 51 weeks ago
3 years 11 weeks ago
3 years 15 weeks ago
3 years 16 weeks ago
3 years 21 weeks ago
3 years 42 weeks ago
4 years 10 weeks ago
4 years 40 weeks ago
5 years 24 weeks ago
5 years 25 weeks ago