Search

Top 60 Oracle Blogs

Recent comments

DBA

When is a Health Check not a Health Check

Thanks to The Human Fly via Twitter @sjaffarhussain I see that Oracle Corporation have a published note on How to Perform a Database Health Check. (Note 122669.1). I read this with some interest as this is something that I do quite frequently as part of my day job. (If you’d like to get me to [...]

Handling Human Errors

Interesting question on human mistakes was posted on the DBA Managers Forum discussions today.

As human beings, we are sometimes make mistakes. How do you make sure that your employees won’t make mistakes and cause downtime/data loss/etc on your critical production systems?

I don’t think we can avoid this technically, probably working procedures is the solution.
I’d like to hear your thoughts.

I typed my thoughts and as I was finishing, I thought that it makes sense to post it on the blog too so here we go…

The keys to prevent mistakes are low stress levels, clear communications and established processes. Not a complete list but I think these are the top things to reduce the number of mistakes we make managing data infrastructure or for that matter working in any critical environment be it IT administration, aviation engineering or medical surgery field. It’s also a matter of personality fit – depending on your balance between mistakes tolerance and agility required, you will favor hiring one individual or another.

Regardless of how much you try, there are still going to be human errors and you have to account for them in the infrastructure design and processes. The real disasters happen when many things align like several failure combined with few human mistakes. The challenge is to find the right balance between efforts invested in making no mistakes and efforts invested into making your environment errors-proof to the point when risk or human mistake is acceptable to the business.

Those are the general ideas.

Just a few examples of the practical solutions to prevent mistakes when it comes to Oracle DBA:

  • test production actions on a test system before applying in production
  • have a policy to review every production change by another senior member of a team
  • watch over my shoulder policy working on production environments – i.e. second pair of eye all the time
  • employee training, database recovery bootcamp
  • discipline of performing routing work under non-privileged accounts

Some of the items to limit impact of the mistakes:

  • multiples database controlfiles for Oracle database (in case DBA manually does something bad to one of them – I saw this happen)
  • standby database with delayed recovery or flashback database (for Oracle)
  • no SPOF architecture
  • Oracle RAC, MySQL high availability setup (like sharding or replication), SQL*Server cluster — architecture examples that limit impact of human mistakes affecting a single hardware component

Both lists can go on very long. Old article authored by Paul Vallee is very relevant top this topic — The Seven Deadly Habits of a DBA…and how to cure them.

Feel free to post your thoughts and example. How do you approach human mistakes in managing production data infrastructure?

OOW09 Session#4 DBA 11g New Features

For all those who came to my last of my four sessions - 11g New Features for DBAs - I appreciate your taking the time. It was a pleasant surprise to see about 500 people showing up at a lunch time slot on the last day of the conference.

Here is the presentation link. I hope you enjoyed the session and found it useful.

OOW'08 Oracle 11g New Features for DBAs

It was my second session at Open World this year. It was full with 332 attendees with a whopping 277 attendees on wait list! the room capacity was 397. Of course, the room did have some fragmentation and not everyone could make it.

Here is the abstract:

There is a world outside the glittering marketing glitz surrounding Oracle 11g. In this session, a DBA and author of the popular 11g New Features series on OTN covers features that stand out in the real world and make your job easier, your actions more efficient and resilient, and so on. Learn the new features with working examples: how to use Database Replay and SQL Performance Analyzer to accurately predict the effect of changes and Recovery Manager (RMAN) Data Recovery Advisor to catch errors and corruption so new stats won't cause issues.

Thank you very much for those who decided to attend. I hope you found it useful. Here is the presentation. You can download it from the Open World site too. Please note, the companion site to see al working examples and a more detailed coverage is still my Oracle 11g New Features Series on Oracle Technology Network.