Who's online

There are currently 0 users and 25 guests online.

Recent comments

June 2011

Data Science Fun with the OOW Mix Session Voting Data

Over the past few weeks Oracle Mix had opened the Oracle OpenWorld 2011 Suggest-a-Session to the general public where anyone could submit or vote on a session. One limitation of the Oracle Mix site was that it was impossible to sort the sessions by votes but that challenge was tackled by Roel Hartman with his blog post and APEX demo. After seeing the top session by votes, it was very interesting to me that around half of the top 15 sessions were all from the same author. That got me thinking…and that thinking turned into a little data hacking project that I embarked on. Now I admit it, I think data is very cool, and even cooler is extracting patterns and neat information from data.

Getting the Data

The Oracle Mix site is very “crawler friendly” — it has well defined code and tags which made extracting the data fairly painless. The basic process I used came down to this:

  1. Get the listing of all the session proposals. That was done by going to the Mix main proposal page and walking all 43 pages of submissions, scraping the direct URL to each session.
  2. Now that I had all of the session abstract URLs, grab each of those pages, all 424 of them
  3. From each session page, extract the relevant bits: Session Name, Session Author, Total Vote Count, and most importantly, who voted for this session.

I did all of that with curl, wget and some basic regex as a “version 1″ but was hoping to go back and try it again using some more sexy technology like Beautiful Soup. That will have to be continued…

The Social Network Effect

With Oracle Mix Suggest-a-Session, people generally vote for a session for one of two reasons:

  1. They are generally interested in the session topic
  2. The session author asked them to vote because of their social relationship

What I think is interesting to know is just how much of the voting is done because of #2. After all, Oracle Mix is a social networking site so there certainly is some voting for that reason. In fact, one of the session authors, Yury Velikanov from Pythian, even blogged his story of rounding up votes. The data shows us this, but more on that in just a bit…

The (Unofficial) Data

I took some time to mingle around the data and found some very interesting things. Let’s just start with a few high level points:

  • There were 424 sessions submitted from 252 different authors.
  • There were 10,125 votes from 2,447 unique voters.
  • The number of submissions ranged from 1 to 24 per author.

Here are some interesting tidbits I extracted from the data set (apologize for not making a cool visualization chart of all this – but I’ll make up for it later):

-- top 10 sessions by total votes:
| total_votes | session_author  | title                                                                          |
|         167 | tariq farooq    | Oracle RAC Interview Q/A Interactive Competition                               |
|         156 | tariq farooq    | Database Performance Tuning: Getting the best out of Oracle Enterprise Manager |
|         137 | tariq farooq    | Overview & Implementation of Clustering & High Availability with Oracle VM     |
|         130 | tariq farooq    | Migrate Your Online Oracle Database to RAC Using Streams and Data Pump         |
|         127 | tariq farooq    | 360 Degrees - Achieving High Availability for Enterprise Manager Grid Control  |
|         126 | yury velikanov  | Oracle11G SCAN: Sharing successful implementation experience                   |
|         124 | sandip nikumbha | Accelerated Interface Development Approach - Integration Framework             |
|         123 | tariq farooq    | Oracle VM: Overview, Architecture, Deployment Planning & Demo/Exercise         |
|         123 | sandip nikumbha | Remote SOA - Siebel Local Web Services Implementation                          |
|         119 | yury velikanov  | AWR Performance data mining                                                    |

-- top 10 voters (who place the most votes)
| voter_name         | votes_placed |
| arup nanda         |           53 |
| tariq farooq       |           43 |
| connie cservenyak  |           36 |
| xiaohuan xue       |           36 |
| bruce elliott      |           36 |
| peter khoury       |           35 |
| yugant patra       |           35 |
| balamohan manickam |           35 |
| suresh kuna        |           34 |
| eddie awad         |           34 |

-- top 10 voters by unique session authors (how many unique authors did they vote for?)
| name               | unique_authors |
| arup nanda         |             28 |
| paul guerin        |             24 |
| eddie awad         |             24 |
| bruce elliott      |             23 |
| xiaohuan xue       |             23 |
| connie cservenyak  |             23 |
| peter khoury       |             22 |
| wai ling ng        |             22 |
| yugant patra       |             22 |
| balamohan manickam |             22 |

-- top 10 session authors by total votes received, number of sessions, avg votes per session
| session_author      | total_votes | sessions | avg_votes_per_session |
| tariq farooq        |        1057 |        8 |              132.1250 |
| yury velikanov      |         557 |        5 |              111.4000 |
| alex gorbachev      |         429 |        6 |               71.5000 |
| sandip nikumbha     |         360 |        3 |              120.0000 |
| syed jaffar hussain |         281 |        4 |               70.2500 |
| kristina troutman   |         233 |        5 |               46.6000 |
| russell tront       |         221 |        3 |               73.6667 |
| wendy chen          |         217 |        3 |               72.3333 |
| asif momen          |         184 |        2 |               92.0000 |
| alison coombe       |         183 |        5 |               36.6000 |

Diving In Deeper

I could not help noticing that Tariq Farooq had the top 5 spots by total vote count. I would assert that is related to these two points:

  1. Tariq has some very interesting and apealing sessions
  2. Tariq has lots of friends who voted for his sessions

I have no doubt there there is some of both in the mix, but just how much influence on the votes is there from a person’s circle of friends? Or to put another way: How many voters only voted for a single session author? Or even more interesting, how many people voted for every session for a single author, and voted for no other sessions? All good questions…with answers that reside in the data!

-- number of users who voted for exactly one author
| users_voting_for_1_author |
|                       828 |

-- number of voters who voted for every session by a given author
-- and total # of votes per voter is the same # as sessions by an author
| users_who_voted_for_every_session_of_an_author |
|                                             826 |

Wow – now that interesting! Of people only voting for a single session author, just two of them did not vote for every one of that author’s sessions. That’s community for you!

Visualizing the Voting Graph

I was very interested to see what the Mix Voting Graph looked liked, so I imported the voting data into Gephi and rendered a network graph. What I was in search of was to identify the community structure of the voting community. Gephi lets you do this by partitioning the graph into modularity classes so that the communities become visible. This process is similar to how the LinkedIn InMap breaks your professional network into different communities.

Here is what the Oracle Mix voting community looks like: />

This is a great visualization of the communities and it accentuates the data from above – the voters who only voted for a single author. This can be seen by the small nodes on the outer part of the graph that have just a single edge between it and the session author’s node. Good examples of this are for Yury Velikanov and Tariq Farooq. Also clearly visible is what I’d refer to the “Pythian and friends” community centered around Alex Gorbachev and Yury Velikanov in the darker green color. There are also several other distinct communities and the color coding makes that visible.

Shouts Out

This is my first real data hacking attempt with web data and using some of the tools like Gephi for the graph analysis. One of my inspirations was Neil Kodner‘s Hadoop World 2010 Tweet Analysis, so I need to give a big shout out to Neil for that as well as his help with Gephi. Thanks Neil!

And One Last Thing

So what are people’s sessions about that were submitted? This Wordle says quite a bit.


If you wish to play on you own:


Here is another graph where the edges are weighed according to votes to an author (obviously related to number of sessions for that author). />

JL @ Turkish Oracle Users Group

See video

Recording courtesy of the people at Formspider.
The topic is about tuning a two-table join, and the video is about an hour long (so there’s a lot more that could be said on the topic, and one day I might do a half day “masterclass” on it).


Here’s a video of the presentation that I did as the keynote for the Turkish Oracle Users’ Group conference earlier on this year. Recording courtesy of the people at Formspider.

The topic is about tuning a two-table join, and the video is about an hour long (so there’s a lot more that could be  said on the topic, and one day I might do a half  day “masterclass” on it).


AWR mining – I/O wait histograms

UPDATE: July 12, 2011  I just noticed (!) that the DBA_HIST_EVENT_HISTOGRAM is only on 11g, so for 10g, it requires collecting the histogram deltas manually off of V$EVENT_HISTOGRAM, something like

set pagesize 150
col event format a25
col tm format a14
select event,
       to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) tm,
from v$event_histogram
where event in ('log file parallel write' ,
                'db file scattered read' ,
                'db file sequential read' )
order by event;

and collecting this information every N seconds and taking the deltas. Also if you want the buckets in columns instead of rows:

    to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) tm,
    sum (case when WAIT_TIME_MILLI=1 then WAIT_COUNT else 0 end) b1,
    sum (case when WAIT_TIME_MILLI=2 then WAIT_COUNT else 0 end) b2,
    sum (case when WAIT_TIME_MILLI=4 then WAIT_COUNT else 0 end) b3,
    sum (case when WAIT_TIME_MILLI=8 then WAIT_COUNT else 0 end) b4,
    sum (case when WAIT_TIME_MILLI=16 then WAIT_COUNT else 0 end) b5,
    sum (case when WAIT_TIME_MILLI=32 then WAIT_COUNT else 0 end) b6,
    sum (case when WAIT_TIME_MILLI=64 then WAIT_COUNT else 0 end) b7,
    sum (case when WAIT_TIME_MILLI=128 then WAIT_COUNT else 0 end) b8,
    sum (case when WAIT_TIME_MILLI=256 then WAIT_COUNT else 0 end) b9,
    sum (case when WAIT_TIME_MILLI=512 then WAIT_COUNT else 0 end) b10,
    sum (case when WAIT_TIME_MILLI=1024 then WAIT_COUNT else 0 end) b11,
    sum (case when WAIT_TIME_MILLI=2048 then WAIT_COUNT else 0 end) b12,
    sum (case when WAIT_TIME_MILLI=4096 then WAIT_COUNT else 0 end) b13,
    sum (case when WAIT_TIME_MILLI=8192 then WAIT_COUNT else 0 end) b14,
    sum (case when WAIT_TIME_MILLI=16384 then WAIT_COUNT else 0 end) b15,
    sum (case when WAIT_TIME_MILLI=32768 then WAIT_COUNT else 0 end) b16,
    sum (case when WAIT_TIME_MILLI=65536 then WAIT_COUNT else 0 end) b17,
    sum (case when WAIT_TIME_MILLI=131072 then WAIT_COUNT else 0 end) b18,
    sum (case when WAIT_TIME_MILLI=262144 then WAIT_COUNT else 0 end) b19,
    sum (case when WAIT_TIME_MILLI=524288 then WAIT_COUNT else 0 end) b20,
    sum (case when WAIT_TIME_MILLI=1048576 then WAIT_COUNT else 0 end) b21,
    sum (case when WAIT_TIME_MILLI=2097152 then WAIT_COUNT else 0 end) b22,
    sum (case when WAIT_TIME_MILLI=4194304 then WAIT_COUNT else 0 end) b23,
    sum (case when WAIT_TIME_MILLI > 4194304 then WAIT_COUNT else 0 end) b24
from v$event_histogram
where event='log file parallel write'
group by event,to_char(sysdate,'J')||to_char(round(sysdate-trunc(sysdate),5)) ;

Exploring I/O histogram values in AWR. In the AWR repository I have several databases, all reading off the same LUNs, so below I try to  take the deltas between each snapshot and sum up all the waits for each bucket in the histogram

set pagesize  50000
set feedback off
set echo off
SET markup HTML on
spool hist.html
with hist as  (
       to_char(trunc(cast(begin_interval_time as date ))+
         (ROUND ((cast(begin_interval_time as  date) -
         TRUNC (cast(begin_interval_time as date))) * 24) / 24),
        'YYYY/MM/DD HH24:MI') btime,
  from dba_hist_event_histogram  h,
       dba_hist_snapshot sn
         h.instance_number = 1
     and sn.instance_number = 1
     and h.event_name like 'db file seq%'
     and sn.snap_id=h.snap_id
     and sn.dbid=h.dbid
select  a.btime,
        sum(b.wait_count - a.wait_count)
from hist a,
     hist b
where a.dbid=b.dbid
  and a.snap_id=b.snap_id-1
  and a.wait_time_milli = b.wait_time_milli
group by a.btime, a.wait_time_milli
having  sum(b.wait_count - a.wait_count) > 0;
spool off
SET markup HTML off

With the HTML output it’s easy to read into excel.
The data can be graphed in Excel using pivot tables or pivot charts, but if I simply save the data in an excel worksheet, then I can open it up in Tableau, which in some circumstances, can be easier to use than excel

The Y axis is in log scale. I don’t find that the data speaks to me immediately (other than there are some seriously slow I/Os) but it is easy to see the major outliers.

One advantage of Tableau is the easy color coordination. Using the “cyclic” pallet, the colors come out pretty good for the I/O histograms (this would be time consuming on Excel)

Security: It’s always the silly things that get you…

I had to laugh when I read this story about Amazon Web Services. It’s posted with an attention grabbing title that implies this is an Amazon problem, but it is squarely down to user error/oversight.

Luckily I’ve not fallen into this trap yet, but I have done equally silly things in the past. That reminds me, I must go to a Pete Finnigan session next time we are at the same conference… :)



HTML with Embedded Images from PL/SQL…

Someone asked me a question about generating HTML with embedded images from PL/SQL and my answer started to spiral out of control, so I figured I’d just write it as an article.



AWR mining

Update: nice query from Jonathan Lewis on AWR:

I found this while  looking for how to eliminate reporting stats over database restarts. Apparently if using snap_id then there will be no snap_id-1 over restarts because the ids will jump more than one across restarts – have to check into this

Exploring ways to mine data in AWR.
One situation I’m looking at now is a number of database that all use the same LUNs on the back end storage, so two metrics that would be interesting to look at are IOs from these databases and the IO latencies to answer the question, “does read and/or write I/O activity correlate to increased latency on the backend storage?”

I have multiple databases in one AWR repository.

I want to create the output in a format that I can read with something like Excel, so I make the values comma delimited,
AWRs values for sysmetric are hourly, so I want to round the times to the nearest hour to make correlation easier and I want to translate the DBID into the database name to make reading the data easier. I could do something like:

                   'Physical Write Total Bytes Per Sec','write','read')||','||
                ( ROUND ((begin_time - TRUNC (begin_time)) * 24) / 24),
                       'YYYY/MM/DD HH24:MI')||','||
        from dba_hist_sysmetric_summary s,
               (select distinct dbid, db_name, host_name
                from dba_hist_database_instance) n
               ( s.metric_name= 'Physical Write Total Bytes Per Sec'
                 s.metric_name= 'Physical Read Total Bytes Per Sec'
               ) and n.dbid=s.dbid
               order by begin_time;

For the I/O latencies it take some more work as I have to compute the deltas between snapshots getting the total elasped time and the count of waits to compute the average wait over the snapshot period for each I/O wait event. I also have to watch out for database bounces, which I have inelligantly addressed by simply filtering only for waits less than or equal to 0. I could do something like

       nvl(round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),3),0) avg_ms
from (
          to_char(trunc(cast(begin_interval_time as date ))+
                  (ROUND ((cast(begin_interval_time as  date)- TRUNC (cast(begin_interval_time as date))) * 24) / 24),
                       'YYYY/MM/DD HH24:MI') btime,
          Lag (e.total_waits) OVER( PARTITION BY e.event_name, e.dbid ORDER BY s.snap_id)
          total_waits count_end,
          Lag (e.time_waited_micro/1000) OVER( PARTITION BY e.event_name,e.dbid ORDER BY s.snap_id)
          time_waited_micro/1000 time_ms_end
            s.snap_id=e.snap_id and
            s.dbid=e.dbid and
            (e.event_name= 'db file sequential read'
          or e.event_name='db file scattered read' )
   order by begin_interval_time
) s,
  (select distinct dbid, db_name, host_name from dba_hist_database_instance) n
       round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),3)  >= 0
order by btime;

Correlating the read/write throughput along with single and multiblock read latencies across multiple database over multiple days can pose challenges. One strategy is to boil the data down to a smaller number of metrics like total read, total write to average single block I/O. For average single block I/O, I’d want to normalize the I/O latency relative to read I/O for that database. Maybe I could get an average of latencies weighted by throughput of that database
Work in progress …


A question came up on Oracle-L recently about the difference in work done by the following two queries:



Before saying anything else, I should point out that these two queries are NOT logically equivalent unless you can guarantee that the table() operator returns a unique set of values – and Oracle doesn’t allow uniqueness to be enforced on collections.

However, the author of the question had ensured that the results from the table() operator contained no duplicates so the two versions of the query returned the same result set. The question is, why does one query do fewer buffer visits than the other – as evidenced by the results from sql_trace and tkprof.

Rule-based Join

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.01       0.01          0          0          0           0
Fetch        1      0.86       0.86          0     200047          0      115195
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.87       0.87          0     200047          0      115195

Rows     Row Source Operation
-------  ---------------------------------------------------
 115195  NESTED LOOPS  (cr=200047 pr=0 pw=0 time=6355 us)
  99704   COLLECTION ITERATOR PICKLER FETCH (cr=0 pr=0 pw=0 time=284 us)
 115195   INDEX RANGE SCAN DOM_NAME_IDX (cr=200047 pr=0 pw=0 time=3191 us)(object id 54309)

CBO with subquery

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.01       0.01          0          0          0           0
Fetch        1      0.78       0.78          0     157986          0      115195
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.80       0.80          0     157986          0      115195

Rows     Row Source Operation
-------  ---------------------------------------------------
 115195  NESTED LOOPS (cr=157986 pr=0 pw=0 time=4720 us cost=536 size=6141 card=267)
  99704   SORT UNIQUE (cr=0 pr=0 pw=0 time=417 us)
  99704    COLLECTION ITERATOR PICKLER FETCH (cr=0 pr=0 pw=0 time=259 us)
 115195   INDEX RANGE SCAN DOM_NAME_IDX (cr=157986 pr=0 pw=0 time=3353 us cost=2 size=21 card=1)(object id 54309)

Notice how the optimizer has unnested the subquery and introduced a “sort unique” – this demonstrates the optimizer’s recognition that the two queries are not logically identical. However, the number of rows from the collection is the same (99,704) and the number of rows after joining is the same (115,195) – the data sets have been rigged so that it is a fair comparison. So why does the explicit join take 200,047 buffer visits when the transformed  subquery approach take only 157,986 buffer visits.

The answer is that you can visit buffers without doing “logical I/Os” – and this benefit accrues most frequently to indexed access paths. Thanks to the way Oracle has sorted the collection before doing the join the nested loop follows the second index (dom_name_idx) in order – this increases the chance that index blocks that were used in the previous cycle of the loop will be re-used in the next cycle, which means that Oracle need not release the pins on the index blocks, but can revisit them incrementing the statitsic: “buffer is pinned count”.

I have to say that I haven’t proved that this is exactly what was happening, but when I suggested to the person who had asked the question that he re-run and check to see if the decrease in “session logical reads” was balanced by an increase in “buffer is pinned count” the answer seemed to confirm the hypothesis. [Correction (see comment #1 from Randolf): the results didn't confirm the hypothesis - but I think that' s because of the change in "pinning" that Randolf describes, so I'll have to find a way to confirm the hypothesis some other time.]


For Exadata geeks

Just letting you know that we are all (almost) done with the Expert Oracle Exadata book work and it will be published in less than a month!!!

Expect (many) new blog entries soon! :-)