Search

Top 60 Oracle Blogs

Recent comments

knowledge

Friday Philosophy – I Am An Exadata Expert

(Can I feel the angry fuming and dagger looks coming from certain quarters now?)

I am an Exadata Expert.

I must be! – I have logged onto an Exadata quarter rack and selected sysdate from Dual.

The pity is that, from some of the email threads and conversations I have had with people over the last 12 months, this is more real-world experience than some people I have heard of who are offering consultancy services. It’s also more experience than some people I have actually met, who have extolled their knowledge of Exadata – which is based solely on the presentations by Oracle sales people looking at the data sheets from 10,000 feet up and claiming it will solve world hunger.

Heck, hang the modesty – I am actually an Exadata Guru!

OUGN 2012 Second Day – Out on the Open Seas

As I said yesterday, I am not one for blogging about conferences. So what the heck, I’ll do another post on this one :-) .

You might have picked up on the fact that I am not very good on the sea and have a lot of nervousness about this idea of spending 2 days on an ocean-going liner. So today is the day we move over to being on the water, the OUGN conference proper. I’m delighted to say I am not the only one nervous about this boat lark, Marie Colgan {or, as I should be calling her, “The Optimizer Lady” – since Doug Burns christened her back at the UKOUG 2011 conference} feels the same. There is nothing better to reduce one’s apprehension about something than finding someone else who is just as if not more anxious about it. I suspect this is an evil way to think but I can’t help it.

Headlong rush to Conference – Preparing the Presentations

With only a few days to go before the UKOUG conference this year I’m preparing my presentations. I know pretty much what I want to say and, for the IOT talk at least, it is not as if I do not have enough material already prepared – some of which has been on the blog posts and some of which has not. (though it did strike me that I could just fire up the blog and walk through the thread, taking questions).

My big problem is not what to say – it is what not to say.

Want to Know More about Oracle’s Core?

I had a real treat this summer during my “time off” in that I got to review Jonathan Lewis’s up-coming new book. I think it’s going to be a great book. If you want to know how Oracle actually holds it’s data in memory, how it finds records already in the cache and how it manages to control everything so that all that committing and read consistency really works, it will be the book for you.

{Update, Jonathan has confirmed that, unexpected hiccups aside, Oracle Core: Essential Internals for DBAs and Developers should be available from October 24, 2011}

When to Fix & When to Find Out – Friday Philosophy

{warning, this is one of my long, rambling Friday Philosophy blogs. Technical blogs are on the way – though there are some nice AWR screen shots in this one :-) }

As a DBA (or System Administrator or Network admin or any other role that involves fixing things that are broken before the shouting starts) there is often a tension between two contending “best practices”:

- getting the system working again as soon as possible.
or
- understanding exactly what the problem is.

Some experts point out {and I generally agree} that unless you actually find out exactly what the problem was, what you did to fix it and via a test case demonstrate why the fix worked, you are not really solving the problem – You are just making it go away. You (hopefully!) know what you changed or the action you took so you have something that you can repeat which might fix it again next time. (NB this does not apply to the classic “turn it off, turn it back on again”, that nearly always is just an aversion therapy).

But it might not fix it next time.

Or you might be going to way more trouble to fix it than is needed {I’m remembering how flushing the shared pool used to get touted as a way to resolve performance issues, but it is a pretty brutal thing to do to a production database}.

You might even be making other problems worse {like slowing down everything on the production database as the caches warm up again, having flushed out all that data in the SGA, or that good-old-standard of rebuilding indexes that simply do not benefit from the rebuild}.

There is another issue with “just fixing it by trying things” in that you are not really learning about the cause of the issue or about how the technology you are looking after works. A big part of what makes an “expert” an expert is the drive, desire and opportunity to take the time to work this out. It’s that last part I sometimes get grumpy about, the opportunity.
For many of us, we do not have the luxury of investigating the root cause. Because we are under so much pressure to “get it fixed right now and worry about the detail later”. But we do not get to the detail as there is then the next “fix this problem right now” and so it goes on. {Kellyn Pot’Vin does a nice post about what she calls the Superman Conundrum on this topic}.

I’ve had exactly this dilema just a couple of months ago. Below are the details, it’s a bit long so skip to the end of the post if you like…

I was creating test data and I decided to use parallel processing to speed it up. I created a month’s worth of data with PL/SQL and then decided to copy it with a simple “insert into …select” statement, updating dates and a couple of other incrementing columns as part of the insert, using parallel. The creation of the second month’s data took longer than the PL/SQL code for the first month took. What the!!!??? I pulled up the AWR information and I could see that the problem was (possibly) with inter process communication between the parallel processes, as shown by the PX DEQ CREDIT:send blkd wait event.

The below screenshot shows the overall instance workload, the green is CPU and the Pink is “Other”. Only one SQL statement is responsible for all of this effort, via 5 sessions (four of which are parallel threads) You can see that the issue had been going on for over an hour {oh, and to a certain extent these pretty pictures are pointless – I am not looking for the exact solution now, but having loaded the pictures up to the blog site I am damn well going to pretty-up my blog with them}:

Drilling into that one session shows that the bulk of the waits by far is for PX DEq Credit: Send blkd:

By selecting that wait event, I got the histogram of wait times since the system started {I love those little histograms of wait times}:

Note that these waits are for long periods of time, around 1/10 of a second on average and some are for up to 2 or 4 seconds.

The thing is, I had anticipated this problem and increased my PARALLEL_EXECUTION_MESSAGE_SIZE to 16K from the default of 2K already, as I knew from experience that the default is way to small and has slowed down parallel execution for me before. So why was I seeing poorer performane now than I anticipated? I’m not understanding stuff. So I needed to change one thing and see what impact it has and repeat until I got to the bottom of it.

Except I could not – the next team in line was waiting for this data and they already had massive pressure to deliver. My job, what my employer was giving me money to do, was to fix this problem and move on. So, in this instance I just pragmatically got on with getting the task done as quickly as I could.

I did what we all do {don’t we?} under times of accute time pressure. I made a handful of changes, using the knowledge I already have and guessing a little, hoping that one of them would do the trick. This included reducing the degree of parallelism, adding the /*+ append */ hint (I simply forgot the first time around), pre-allocating the required space to the tablespace, muttering “pleaseopleaseoplease” under my breath….

It worked:

The job ran in less than 20 minutes and used less resource during that time. Well, it waited less anyway.
The wait histograms show lots and lots of shorter duration waits:

The duplication took 20 minutes when the previous attempt had been terminated after 2 hours when other factors forced it to be curtailed. Job done.

But the thing is, the problem was not fixed. I got the task done in a timescale that was required, I satisfied the need of the client, but I was and am not sure exactly why.

If I was a permanent employee I would consider pressing for being allowed to spend some time working this out, as my employer benefits from me extending my knowledge and skills. This is not always a successful strategy :-) {but it does indicate to me that my employer has a Problem and I need to address that}. In any case, as I was a consultant on this job, I was being paid to already know stuff. So it is now down to me, some personal time, the internet and people more knowledgeble than me who I can ask for help to sort this out.

And that was my main point. Often at work you have to get over the issue. Later on, you work out what the issue was. I wish I could get paid for the latter as well as the former. The real blow for me is that I no longer have access to that site and the information. My job was complete and, whether they have really shut down my access or not, I am not allowed to go back into the systems to dig around. I think I now know the issue, but I can’t prove it.