Search

Top 60 Oracle Blogs

Recent comments

Method R

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."Ok. I'm listening.Inside the brochure, the first page lays out the main benefits:

  • bypass the passport lines
  • no paper Customs declaration
  • in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.So what does this thing cost? It's right there in the brochure:

  1. Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
  2. Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
  3. Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)How could a designer make the airport experience better for the customer? Here are a few ideas:

  • Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
  • Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
  • Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

  • Written white papers and other articles that explain Method R to you at no cost.
  • Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
  • Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
  • Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
  • Provided consulting services to make people awesome at making their systems faster and more efficient.
  • Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
  • Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
  • Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
  • Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.

Cary on Joel on SSD

Joel Spolsky's article on Solid State Disks is a great example of a type of problem my career is dedicated to helping people avoid. Here's what Joel did:

  1. He identified a task needing performance improvement: "compiling is too slow."
  2. He hypothesized that converting from spinning rust disk drives (thanks mwf) to solid state, flash hard drives would improve performance of compiling. (Note here that Joel stated that his "goal was to try spending money, which is plentiful, before [he] spent developer time, which is scarce.")
  3. So he spent some money (which is, um, plentiful) and some of his own time (which is apparently less scarce than that of his developers) replacing a couple of hard drives with SSD. If you follow his Twitter stream, you can see that he started on it 3/25 12:15p and wrote about having finished at 3/27 2:52p.
  4. He was pleased with how much faster the machines were in general, but he was disappointed that his compile times underwent no material performance improvement.

Here's where Method R could have helped. Had he profiled his compile times to see where the time was being spent, he would have known before the upgrade that SSD was not going to improve response time. Given his results, his profile for compiling must have looked like this:

100%  Not disk I/O
0% Disk I/O
---- ------------

Profiling with my Boy

We have an article online called "Can you explain Method R so even my boss could understand it?" Today I'm going to raise the stakes, because yesterday I think I explained Method R so that an eleven year-old could understand it.

Yesterday I took my 11 year-old son Alex to lunch. I talked him into eating at one of my favorite restaurants, called Mercado Juarez, over in Irving, so it was a half hour in the car together, just getting over there. It was a big day for the two of us because we were very excited about the new June 17 iPhone OS 3.0 release. I told him about some of the things I've learned about it on the Internet over the past couple of weeks. One subject in particular that we were both interested in was performance. He likes not having to wait for click results just as much as I do.

According to Apple, the new iPhone OS 3.0 software has some important code paths in it that are 3× faster. Then, upgrading to the new iPhone 3G S hardware is supposed to yield yet another 3× performance improvement for some code paths. It's what Philip Schiller talks about at 1:42:00 in the WWDC 2009 keynote video. Very exciting.

Alex of course, like many of us, wants to interpret "3× faster" as "everything I do is going to be 3× faster." As in everything that took 10 seconds yesterday will take 3 seconds tomorrow. It's a nice dream. But it's not what seeing a benchmark run 3× faster means. So we talked about it.

Happy Birthday, OFA Standard

Yesterday I finally posted to the Method R website some of the papers I wrote while I was at Oracle Corporation (1989–1999). You can now find these papers where I am.

When I was uploading my OFA Standard paper, I noticed that today—24 September 2009—is the fourteenth birthday of its publication date. So, even though the original OFA paper was around for a few years before 1995, please join me in celebrating the birthday of the final version of the official OFA Standard document.

On the Importance of Diagnosing Before Resolving

Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).

Here is the profile that details where the boy's time went during his errand:

                    --Duration---
Subtask minutes % Executions
------------------ ------- ---- ----------
Talk with friends 37 62% 3
Choose item 10 17% 5
Walk to/from store 8 13% 2
Pay cashier 5 8% 1
------------------ ------- ---- ----------
Total 60 100%

I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.

The interesting question that a reader posted is this:

Not sure this is always the right approach. For example, lets imagine the son has to pick 50 items
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...

Let's explore it. Here's what the profile would look like if this were to happen:

          --Duration---
Subtask minutes % Executions
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%

The point of the inquiry is this:

The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.

In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:

  1. (Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...
    1. Can you eliminate any executions without sacrificing required function?
    2. Can you improve (reduce) individual execution latency?
  2. (Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).

Here's a narrative of executing the steps of the diagnostic phase, one at a time, upon the new profile, which—again—is this:

          --Duration---
Subtask minutes % Executions
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%
  1. Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
  2. Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
  3. Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
  4. Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
  5. Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
  6. Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
  7. Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
  8. Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).

Now you're done assessing the effects of (1) execution elimination and (2) latency reduction for each line in the profile. That ends the diagnostic phase of the method. The next step is the resolution phase: to determine which of these candidate solutions is the best. Given the analysis I've walked you through, I'd rank the candidate solutions in this order:

  1. Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
  2. Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
  3. Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.

That's it.

So the point I want to highlight is this:

I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.

It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.

The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.

It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.