Yesterday I finally posted to the Method R website some of the papers I wrote while I was at Oracle Corporation (1989–1999). You can now find these papers where I am.
When I was uploading my OFA Standard paper, I noticed that today—24 September 2009—is the fourteenth birthday of its publication date. So, even though the original OFA paper was around for a few years before 1995, please join me in celebrating the birthday of the final version of the official OFA Standard document.
Since it’s been nearly ten years since I wrote my book, Scaling Oracle8i, I thought it was about time that I started writing again. I thought I would start with the new-fangled blogging thing, and see where it takes me. Here goes.
As some will know, I run a small consulting company called Scale Abilities, based out of the UK. We get involved in all sorts of fun projects and problems (or are they the same thing?), but one area that I seem to find myself focusing on a lot is storage. Specifically, the performance and architecture of storage in Oracle database environments. In fact I’m doing this so much that, whenever I am writing presentations for conferences these days, it always seems to be the dominant subject at the front of my mind.
One particular common thread has been the effect of latency. This isn’t just a storage issue, of course, as I endeavoured to point out in my Hotsos Symposium 2008 presentation “Latency and Skew”. Latency, as the subtitle of that particular talk said, is a silent killer. Silent, in that it often goes undetected, and the effects of it can kill performance (and still remain undetected). I’m not going to go into all the analogies about latency here, but let’s try and put a simple definition out for it:
Latency is the time taken between a request and a response.
If that’s such a simple definition, why is it so difficult to spot? Surely if a log period of time passes between a request and a response, the latency will be simple to nail? No.
DBAs from time to time must write shell scripts. If your environment is strictly Windows based, this article may hold little interest for you.
Many DBAs however rely on shell scripting to manage databases. Even if you use OEM for many tasks, you likely use shell scripts to manage some aspects of DBA work.
Lately I have been writing a number of scripts to manage database statistics - gathering, deleting, and importing exporting both to and from statistics tables exp files.
Years ago I started using the shell builtin getopts to gather arguments from the command line. A typical use might look like the following:
while getopts d:u:s:T:t:n: arg
case $arg in
Jeremiah Wilton’s presentation shows how to use Oracle 11g Real Application Testing to quantify effect of system and database configuration changes. As an example, he uses Real Application Testing to validate the Automatic Advisor recommendations, and uncovers some interesting results.
Recently, my client deployed a new application and had this intermittent “Deadlock Storm” …
A trace file was sent and I was able to pinpoint the cause of the deadlock and the session that caused it.
The deadlock was a TX enqueue with mode of 4 (S – share) which could be verified by looking at the following lines of the Process State dump:
last wait for 'enq: TX - row lock contention' blocking sess=0x 7000000cb239d60 seq=7849 wait_time=2929705 seconds since wait started=3 name|mode=54580004, usn<<16 | slot=a0028, sequence=283f2
the "enqueue and lock mode" is explained as:
mode=54580004 (see above)
5458 (hex) = TX (ascii)
0004 (hex) = mode 4 (S – share)
My earlier blog about shared pool duration got an offline response from one of my reader:
” So, you say that durations aka mini-heaps have been introduced from 10g onwards. I have been using Steve Adams’ script shared_pool_free_lists.sql. Is that not accurate anymore?”
Shared pool free lists
I have a great respect for Steve Adams . In many ways, he has been a great virtual mentor and his insights are so remarkable.
Coming back to the question, I have used Steve’s script before and it is applicable prior to Oracle version 9i. In 9i, sub-heaps were introduced. Further, shared pool durations were introduced in Oracle version 10g. So, his script may not be applicable from version 9i onwards. We will probe this further in this blog.
This is the problem with writing anything about internals stuff, they tend to change from version to version and In many cases, our work can become obsolete in future releases(including this blog!).
In version 9i, each sub-heap of the shared_pool has its own free list. In version 10g and 11g, each duration in sub-heap has its own free list. This is visible through x$ksmsp and column x$ksmsp.ksmchdur indicates the duration that chunk belongs to. In 9i, that column always has a value of 1 (at least, that I have experimented so far). In 10g & 11g (up to 188.8.131.52), there are exactly 4 durations in each sub-heap and values range from 1-4 for this column ksmchdur. Each duration has its own free list.
Richard Feynman defined science as "the belief in the ignorance of experts." Science begins by questioning established ideas. ...Even those ideas promoted by so-called experts.
The value of science that's obvious to everybody is the chance you might discover some valuable truth that nobody else has discovered before. That's the glamorous idea that might motivate you to begin the hard work that science sometimes requires. Science is also valuable to you when you learn that an established idea, no matter how much you may not like it, really is true after all. That second value of science is not as glamorous, but it's just as important. My little prayer with respect to that possibility is, "If an idea I believe is wrong, please let me find out before anybody else does."
Everyone can do science. Not just "scientists"; all of us. But you need to do science "right," or it's not science. Do it right, and you accumulate a little bit of truth. Do it wrong, and and you've wasted your time, or worse, you've doomed yourself to waste more of your time in the future, too.
The difference between "right" and "wrong" in science is not some snooty, bureaucratic concept. You don't need a license or a blessing to do science right. You just need to ensure that the cause-effect relationships you choose to believe are actually correct. One of the rules for doing science right is that you measure instead of just asserting your opinion.
Different people have different thresholds of skepticism. Some people believe new ideas, whether they're true or false, with very little persuasion. The people who are persuaded easily to believe false things cannot contribute much useful new knowledge to their communities (irrespective of how much they might publish).
I presented about various performance myths in my ‘battle of the nodes’ presentation. One of the myth was that how spawning parallel query slaves across multiple RAC instances can cause major bottleneck in the interconnect. In fact, that myth was direct result of a lessons learnt presentation from a client engagement. Client was suffering from performance issues with enormous global cache waits running in to 30+ms average response time for global cache CR traffic and crippling application performance. Essentially, their data warehouse queries were performing hundreds of parallel queries concurrently with slaves spawning across three node RAC instances.
Of course, I had to hide the client details and simplified using a test case to explain the myth. Looks like either a)my test case is bad or b) some sort of bug I encountered in 184.108.40.206 version c) I made a mistake in my analysis somewhere. Most likely it is the last one . Greg Rahn questioned that example and this topic deserves more research to understand this little bit further. At this point, I don’t have 220.127.116.11 and database is in 10.2.0.4 and so we will test this in 10.2.0.4.
UDP is one of the protocol used for cache fusion traffic in RAC and it is the Oracle recommended protocol. In this article, UDP traffic size must be measured. Measuring Global cache traffic using AWR reports was not precise. So, I decided to use a dtrace tool kit tool:udpsnoop.d to measure the traffic between RAC nodes. There are two RAC nodes in this setup. You can read more about udpsnoop.d. That tool udpsnoop.d can be downloaded from dtrace toolkit . Output of this script is of the form:
Last week, one of our clients had a sudden slow down on all of their applications which is running on two node RAC environment
Below is the summary of the setup:
– Server and Storage: SunFire X4200 with LUNs on EMC CX300
– OS: RHEL 4.3 ES
– Oracle 10.2.0.3 (database and clusterware)
– Database Files, Flash Recovery Area, OCR, and Voting disk are located on OCFS2 filesystems
– Application: Forms and Reports (6i and also lower)
As per the DBA, the workload on the database was normal and there were no changes on the RAC nodes and on the applications. Hmm, I can’t really tell because I haven’t really looked into their workload so I don’t have past data to compare.
It’s been a week since I started my new job at Oracle Corporation. I’m a remote worker which means that the first day of work wasn’t the usual event since I just went to my home office and got on a concall with my new manager. After getting connectivity and accounts set up properly, I was able to pretty quickly work through the new hire checklist of forms and mandatory training.
My new Oracle-provided laptop arrived around mid-week and I realized that, at least for now, I’ll have to revert back to using the Windows-based laptop and (hopefully temporarily) put my MacBook Pro on the shelf. Actually, my wife is very excited since she’ll get the MBP to use now and we’ll do the usual “trickle down” to the kids so that the oldest computer in the “fleet” will get ditched.