Search

OakieTags

Who's online

There are currently 0 users and 16 guests online.

Recent comments

Oakies Blog Aggregator

Indexes and Initrans (Blackstar)

It’s been a very tough week so to help keep my mind off rather sad events, thought I’ll finish off one of my unpublished articles. Initrans is a physical attribute that determines the initial number of concurrent transaction entries allocated within each data block for a given table/index/cluster. Every transaction that updates a block has to acquire an […]

Yes SQL ! the conference Jan 26 & 27 at Oracle

Wow, this is big – a Yes SQL conference !

Finally ! It’s here ! Yes SQL ! the conference!

At Oracle headquarters !

Send me your questions!

Have questions about SQL? about No SQL? about Yes SQL? About how Oracle is faring in the industry with SQL? I’ll be moderating a SQL panel discussion with

  • Andy Mendelsohn (Executive Vice President for Database Server Technologies)
  • George Lumpkin (Vice President, Product Management)
  • Bryn Llewellyn (Distinguished Product Manager)
  • Steven Feuerstein (Developer Advocate)
  • Mohamed Zait (Architect) will explain Oracle’s strategy

Send me your questions – ideally just put them in the comments below – and I’ll ask them in the panel discussion.

Is Oracle Web scale? Is Oracle Web scale? Come ask the panel discussion.

The Conference

SQL is fascinating and almost every business depends on SQL to some degree. SQL permeates modern IT and is the #1 method of data access, so finally a conference dedicated to SQL.

Here are some of the rockstar SQL speakers at the conference:

  • Steven Feuerstein
  • Janis Griffin
  • Bryn Llewellyn
  • Andy Mendelsohn
  • Kerry Osborne
  • Tanel Poder
  • Mauro Pagano
  • Carlos Sierra

Agenda

Check out the full 3 day conference agenda.

Coordinating with BIWA

Yes SQL! is coordinating with BIWA. The first 2 days are Yes SQL! (Jan 26 and 27) and the 3rd day (Jan 28) is BIWA (Business Intelligence, Warehousing and Analytics conference)

Full day SQL class

On Jan 28 a full day (8 am-5 pm)SQL tuning workshop is being offered. The workshop is being presented by Carlos Sierra and Mauro Pagano. Seats limited so sign up early here.

Talk Examples

Registration

register here

nocoug

Installing a New Enterprise Manager 13c Environment

As I patiently wait for approvals to post my Oracle Management Cloud blog posts, I thought I would just post on Enterprise Manager 13c and answer questions that have been posed to me via email and comments.

Is there anything new I need to watch for as I install EM13c?

  1. Your management repository database should be a single-tenant 12.1.0.2 database.
  2. Disable adaptive_features parameter in the database you will use for your repository.
  3. Update the /etc/security/limits.conf file to set soft nofiles to 30000 from previous common setting of 8192
  4. session_cached_cursors parameter should be set to 500
  5. Even though it may warn you on port settings, you’ll note that your settings may contain the range that was requested.  You can disregard this and the installation will allow you to ignore and proceed.

eminstall_porthttp://dbakevlar.com/wp-content/uploads/2016/01/eminstall_port.jpg 596w" sizes="(max-width: 596px) 100vw, 596px" />

What kind of space allocation should I have for the software library and $AGENT_HOME?

  1. If you plan on using patch plans and simplify upgrades, your software library should have at least 100G of space.  For large EM environments, we easily recommend 250-500G for the software library.
  2. If you decide to implement an AWR Warehouse, the $AGENT_HOME is the location for the dump files of AWR data that is then pushed to the AWR Warehouse host.  These files could be anywhere between 50M-11G.  Ensure you have the space for theses or failures in ETL processing could occur.

Is there anything in the install that frustrates people?

  1. The percentage of install completion vs. the disbursement of time. Any EM Administrator is quite used to this, but for anyone unfamiliar with EM installations, the install stalls at about 46% and it can lead them to panic.

eminstall_46http://dbakevlar.com/wp-content/uploads/2016/01/eminstall_46-768x586.jpg 768w, http://dbakevlar.com/wp-content/uploads/2016/01/eminstall_46.jpg 808w" sizes="(max-width: 808px) 100vw, 808px" />

^ This is when you go out for lunch, maybe a long one….

eminstall_49http://dbakevlar.com/wp-content/uploads/2016/01/eminstall_49-768x582.jpg 768w, http://dbakevlar.com/wp-content/uploads/2016/01/eminstall_49.jpg 806w" sizes="(max-width: 806px) 100vw, 806px" />

^ Back from lunch, but until we get to about 52%, it could be awhile.  Time to go get an after-lunch coffee! :)  Just know that from 46% till after 50%, its takes a while to get everything compiled and configured, so don’t get frustrated and think it’s hung.  Its just got a lot of work to do during this time.

There are already a number of posts on what’s new and how to perform and install, so I’m going to keep this short, but hopefully valuable to those that are looking to install it for the first time or just looking for a few pointers.

 

 

 



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Installing a New Enterprise Manager 13c Environment], All Right Reserved. 2016.

Enterprise Manager Cloud Control 13c : Navigation and “Look and Feel”

o-enterprisemgr-13c-clr-2769481I’ve continued to play around with Cloud Control 13c and I’m generally getting a nice vibe from it.

One of the things I really hated about Grid Control 10g and 11g was the navigation. It felt like you had to click on 50 links to get to the thing you wanted. When Cloud Control 12c came along and had a main menu it was a massive improvement. Even so, it was still a little annoying as the menu was split, with some bits on the left and some bits on the top-right.

em12c-menu

In Cloud Control 13c, these menus have been brought together into the top-right of the screen.

em13c-menu1

If the screen size is smaller, the buttons collapse to show just the icons, which saves space.

em13c-menu2

It probably sounds really trivial, but having both menus together is a really nice touch. I can’t count the number of times I’ve been fumbling around, unable to find something, only to remember it is in that blasted menu at the top-right. Now there is no excuse. :)

The job scheduler navigation is also a lot nicer. In Cloud Control 12c we had a bunch of drop-downs and a “Go” button.

em12c-jobs

In Cloud Control 13c there are tiles along the top to alter the context of the output and the tree on the left allows you to quickly flip between criteria.

em13c-jobs

It is so much quicker to get the information you want this way.

So as far as I’m concerned, Cloud Control 13c is getting a big thumbs-up from a navigation perspective!

A couple of people have asked my impression about the new look and feel. If we ignore the navigation, most of the pages are quite similar to what we had before, so there is no need to panic. Overall it has a sparser, cleaner look, which is more in keeping with the way the web is these days, so I think that’s a good thing. Anyone who has used Oracle Cloud will find the look very familiar. :)

I guess the biggest bonus of the new look and feel is it is more responsive. On some of the old pages you had a lot of sideways scrolling to do if you have a small browser window. The new look and feel deals a lot better with that. It’s not perfect, but it is better. So I’m giving the new look and feel a big thumbs-up too!

Being the bitter old man that I am, I reserve the right to change my mind and hate it all in the future. :)

Cheers

Tim…

Caveat: I use a very small subset of the functionality available from Cloud Control, so my opinion is going to be based on the bits I use a lot. It might be that other areas have been adversely affected by the new navigation or look and feel, but the bits I care about are looking good.


Enterprise Manager Cloud Control 13c : Navigation and “Look and Feel” was first posted on January 13, 2016 at 3:40 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

AskTom–help us to help you

On AskTom, obviously you’re are going to be posting code and/or code output.  It’s probably going to be the most vital part of the question’s test case so that we can help you out.

So please dont forget the “” and “” tags around your question. 

Otherwise we get this – it’s all wrapped and very hard to digest.  That’s a sure fire way for us to go “Yup… I might come back to that one later” Smile

image

Graphora

I knew I had to try out Neto’s  Graphora when I saw that it was a docker container to collect Oracle performance data and send it to Graphite. Graphite has been on my radar as I see and hear about more and more companies using it to graph and monitor performance data.  Any graphical Oracle performance tool interests me and recently I’ve been spending some time investigating Docker so having everything rolled up in Graphora sounded perfect.

To run Graphora you need docker. It’s easy enough to do a yum install docker, but I decided to use a VM with docker already installed. The VM image I used was from Delphix Express Landshark environment and you can use this image as well if you’d like. Of the three machine images that come withDelphix Express Landshark environment , I  used the source machine image.

The basic architecture is Graphora docker container connects to a running Oracle database, collects I/O statistics and sends them to graphite running in a docker container.

Screen Shot 2016-01-12 at 4.28.00 PM

Start Docker

# service docker start
Starting cgconfig service:                         [  OK  ]
Starting docker:                                   [  OK  ]

Pull Graphora docker container image

# docker pull netofrombrazil/graphora
latest: Pulling from netofrombrazil/graphora

Enable sys connections over SQL*Net

$ORACLE_HOME/bin/./orapwd file=$ORACLE_HOME/dbs/orapw${ORACLE_SID} password=change_on_install

otherwise getting error

ORA-01031: insufficient privileges

on the following command when trying to create the graphora user

Create graphora user

Then run

# docker run -ti netofrombrazil/graphora --host 172.16.100.11  --port 1521 --sid XEXE --create
Enter sys password: change_on_install
Creating user graphora
Grant access for user graphora to create sessions
Grant select privilege on V$SESSION_EVENT, V$SYSSTAT, V$STATNAME for user graphora
GraphOra is ready to collect your performance data!

I’m creating the  graphora user on my “source” landshark VM which has an ip of 172.16.100.11. It’s sid is XEXE.  The flag “-it” means interactive (-i) and allocate a tty (-t).

Startup Graphite

The “-p” flags map ports in the container to ports on the host running the container.

Graphite uses these ports

  • 80: the graphite web interface
  • 2003: the carbon-cache line receiver (the standard graphite protocol)
  • 2004: the carbon-cache pickle receiver
  • 7002: the carbon-cache query port (used by the web interface)

So the command is

docker run -p 80:80 -p 2003:2003 -p 2004:2004 -p 7002:7002 -d nickstenning/graphite

Now you can access graphite at the URL of the machine. It looks like

Screen Shot 2016-01-12 at 1.57.56 PM

Startup Graphora

Now the big moment – startup graphora sending performance data from my XEXE database to graphite

# docker run -ti netofrombrazil/graphora \
              --host 172.16.100.11 \
              --port 1521 \
              --sid XEXE \
              --interval 10 \
              --graphite graphs-db \
              --graph-port 2003
 phyReads: 0 phyWrites: 0 dbfsr: 0.00 lfpw: 0.00
 #ff0000;">Could not connect to graphs-db on port 2003.

whoops, what went wrong ?

What went wrong

First thing to do was see if graphite was actually working correctly. I decided to send data to graphite with netcat

export PORT=20003
export SERVER=172.16.100.11
echo "local.random.diceroll 4 `date +%s`" | nc  ${SERVER} ${PORT}

And yes, this works. I can see a new “local” category of data in the graphite web interface. Graphite is working correctly so the problem must be in how I all graphora.

Having never used graphite before nor graphora, I wasn’t sure what all the command line options were and I got one of the arguments wrong, but how could I find out what the command line arguments were? They aren’t outlined on Neto’s blog. They aren’t on the docker download page https://hub.docker.com/r/netofrombrazil/graphora/

So now I dug around to try and find out what is in Neto’s package. I discovered the command

docker diff [name of container]

 to show what new and modified files there are in the container.

# docker diff graphora
C /opt
C /opt/graphOra
C /opt/graphOra/graphOra
C /opt/graphOra/lib
C /opt/graphOra/lib/database.pyc

Not that many files in the container. Question “how do I see these files that are inside the container?” In order to get to the file inside the container,  I have to exec a bash shell into the container (or know where the actual mapped files are)

docker exec -i -t graphora bash

This will give me a shell in the container and I can now vi the files. I then vi graphora file

# vi /opt/graphOra/graphOra

and find the command line arguments

def cli_args():
  parser.add_argument('--host', required=True, help='Oracle Database server')
  parser.add_argument('--port', required=True, help='Database listener port')
  parser.add_argument('--sid', required=True, help='Database SID')
  parser.add_argument('--user', default='graphora', help='User to connect into the database')
  parser.add_argument('--password', default='WeLoveThisCompany', help='user db password')
  group = parser.add_mutually_exclusive_group()
  group.add_argument('--create', action='store_true', help='Setup graphOra to monitor the database instance')
  group.add_argument('--interval', default=10, type=int, help='Interval time in seconds between each data collection')
  parser.add_argument('#ff0000;">--graphite', default=' ', help='Specify IP or Hostname of Graphite Server')
  parser.add_argument('--graph-port', type=int, help='Specify Graphite Server port')
  parser.add_argument('--print-stats', default='true', choices=['true','false'], help='Print stats')
  parser.add_argument('--project', default='oracle', help='Specify a string to identify your data in Grafana')

Getting it right

My problem is that I thought “–graphite” was possibly the default name of graphite database. I was thinking since graphite collects data maybe it has it’s own database with a default name graphite-db which I saw in Neto’s command line example. Well turns out as we can see above that “–graphite” is the IP or host name of the host running graphite. In my case everything, the Oracle XE database, graphite and graphora are running on my one machine, 172.16.100.11. With that change it now works:

# docker run -ti netofrombrazil/graphora \ 
              --host 172.16.100.11 \
              --port 1521 \
              --sid XEXE \
              --interval 10 \
              --graphite 172.16.100.11 \ 
              --graph-port 2003

I can then plot the values. On the top left of the Graphite interface I can click  on “netapp” and continue expanding until I get to the I/O metrics.

Screen Shot 2016-01-12 at 2.08.05 PM

I can then click on any of the metrics to see the graph.

If I click on the “dashboard” menu item in the top right, I can click on the metrics and get a graph for each metric building a dashboard,

Screen Shot 2016-01-12 at 1.12.13 PM

Choose metrics by clicking on the names in top right. Each time you a choose a name, a list of next level list of names will be shown until you choose the actual metric.

The UI might be a little more intuitive in the tree mode. To change to tree mode, select “Dashboard” then “Configure UI”

Screen Shot 2016-01-12 at 2.16.57 PM

Then select “Tree”

Screen Shot 2016-01-12 at 2.17.17 PM

Now the interface looks like

Screen Shot 2016-01-12 at 1.45.11 PM

Grafana

Looking at Neto’s blog, his graph UI looked much different and nicer than the graphite graphs. I poked around and it looks like his graph was a grafana graph. Grafana works with graphite and there is a docker version of it, so I started it up:

docker run -i -p 3000:3000 grafana/grafana

Now I can access grafana on my machine at port 3000. Go to the URL  your_machine:3000 and you will see a login page

Screen Shot 2016-01-12 at 2.31.05 PM

login with admin/admin

Screen Shot 2016-01-12 at 2.32.01 PM

If you don’t see the menus on the left then click the Grafana icon in the top right and the menus should appear. First step is to add the graphite data source. Click on “Data Source” on the right

Screen Shot 2016-01-12 at 2.33.37 PM

Then click “Add new” in the top right.

Screen Shot 2016-01-12 at 2.33.47 PM

Main thing is to add the URL for graphite which in my case is simple the IP of my machine, 172.16.100.11

Name the data source “graphora” in this example.

Now click on “home” on the tip menu and then “+New” on the bottom buttons

Screen Shot 2016-01-12 at 2.36.52 PM

Click on the green line in the top middle and choose “Add Panel -> Graph”

Screen Shot 2016-01-12 at 2.38.47 PM

Set the data source in the bottom left for the dashboard to be “graphora” data

Screen Shot 2016-01-12 at 2.39.51 PM

Then click “select metric” in the middle of the screen just below the “metric” tab label.

When you click “select metric” you will see the metrics collected by graphora which all start out with the prefix “netapp”

Screen Shot 2016-01-12 at 2.40.02 PM

You have to choose the full precision of the name like

netapp perf oracle 172 16 100 11 XEXE dbfsr_latency

but for the other metrics, you can choose duplicate and then change the value of the last field to go quicker

Screen Shot 2016-01-12 at 2.45.42 PM

With all 5 metrics choose my dashboard looks like

Screen Shot 2016-01-12 at 1.51.54 PM

Summary and Reflections

The whole combination of technologies is interesting – Docker, Oracle, Graphora,  Graphite, Grafana etc.

In this particular case with Graphora I wonder if it would just be easier to distribute the 2 files necessary to run Graphora instead of a Docker container. I don’t know. Docker is cool. Docker makes a lot of sense for things that have many files and things that I want to run multiple concurrent versions of without having the different versions interfere, but is it overkill for a couple of python files? It’s nice that with docker, one  doesn’t have to have any download or install instructions. It’s as simple as “docker run container”.

When it comes to data collection and graphing, it’s attractive that these packages take care of so much of the scripting and plumbing.  For example when it comes to collecting data, I tend to either stuff it in flat files or stuff it in Oracle. Neither of these approaches is great. Flat files are inefficient and require a lot of coding. Oracle his huge and makes deployment a bear if one depends on it. WIth graphite it’s great that it comes with aggregation routines and purging routines.

When it comes to graphing, I’ve been completely happy with Highcharts though again if using Highcharts I have to create my own HTML files with javascript etc where as in graphite an grafana, it’s just point and click. On the other hand a lot of the optimizations, aggregation, graph parent child relationships, and custom displays seem difficult to impossible with these packages.

All in all, quite intriguing. I guess if I want to get going fast, then graphite might be a good way to display a bunch of statistics in line graphs. On the other hand, the problem  that I see missed routinely in the industry is that creating performance dashboards is a lot more then pasting up a bunch of line graphs. Data visualization, in order to be powerful, needs a lot of intelligence in what metrics are chosen, how they are aggregated, correlated and displayed. Often the display requires a customization and  new approaches to displaying data visually. What I want to see more are  pre-set dashboards that intelligently layout and correlate data for particular use cases. Use cases in the realm of performance monitoring of IT systems tend to cover the same types of situation – where are you spending time, how much CPU, I/O, concurrency contention etc.

Cool discovery with Docker is that to get a list of files created or modified by a docker container run

docker diff container_name

Then I can get into the container with exec bash to look at or modify these files

docker exec -i -t graphora bash

If I modify the files then I can then restart the container

docker restart graphora

To see the docker containers that exist and/or are running, use the command

docker ps -a

Reference

Graphite uses Whisper database default

Graphite vs Grafana as well as storage database options such as InfluxDB and Cyanite (Cassandra)

Introducing stapflame, extended stack profiling using systemtap, perf and flame graphs

There’s been a lot of work in the area of profiling. One of the things I have recently fallen in love with is Brendan Gregg’s flamegraphs. I work mainly on Linux, which means I use perf for generating stack traces. Luca Canali put a lot of effort in generating extended stack profiling methods, including kernel (only) stack traces and CPU state, reading the wait interface via direct SGA reading and kernel stack traces and getting userspace stack traces using libunwind and ptrace plus kernel stack and CPU state. I was inspired by the last method, but wanted more information, like process CPU state including runqueue time.

I started playing around with systemtap, and was able to read a process’ CPU state including run queue time. This involves using kernel tapset scheduler, which unfortunately needs the kernel debug info packages (kernel-euk-debuginfo and kernel-uek-debuginfo-common, available via oss.oracle.com). It is not hard to include wait interface information, this is work Luca and I collaborated on in the past. I created a systemtap script called cpu_and_wait_profile.stap, which shows the oracle database state transition between on cpu and in a wait, as well as kernel CPU state information. This is how that should be executed and what it outputs:

# stap -x 6641 cpu_and_wait_profile.stap
Begin.
w     - 388 (no begin)
c     1    	tot:         334	on:         333	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     2 384	tot:           5	on:           5	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
c     3    	tot:         644	on:         644	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     4 212	tot:          58	on:          41	off:          17	q:           5	ti:           9	tu:           0	#slices:    1
c     5    	tot:         371	on:         371	off:           0	q:           0	ti:           0	tu:           0	#slices:    0
w     6 212	tot:         146	on:          58	off:          88	q:          14	ti:          69	tu:           0	#slices:    1
c     7    	tot:        1787	on:        1745	off:          42	q:          37	ti:           0	tu:           0	#slices:    2
w     8 212	tot:         265	on:          30	off:         234	q:          12	ti:         218	tu:           0	#slices:    1

The first column indicates if the process is inside an Oracle wait event (w), or is considered running on cpu (c) by the database.
The second column is a serial number. The third column is the wait event number if the process is considered inside a wait, or empty if on CPU.
The column ‘tot’ is the total time (in microseconds) spent on cpu or inside a wait event.
The column ‘on’ is the time spent truly running on CPU. Obviously, ‘off’ is all the time not spent running on the CPU.
The columns ‘q’, ‘ti’ and ‘tu’ are off CPU states. ‘q’ is time spend in the CPU runqueue. This is not a kernel state, a process gets the state ‘TASK_RUNNING’ to indicate it is willing to run, after which it’s the task of the scheduler to manage willing to run processes and get them onto a CPU. ‘ti’ is a kernel state, which means ‘TASK_INTERRUPTABLE’. This is a state after which the process is taken off the CPU, because it is waiting for something to complete. ‘Something’ means a disk IO, a timer to expire, etc. ‘tu’ means ‘TASK_UNINTERRUPTIBLE’, which is used if a process should only continue when a specific condition is met, and reacting to signals would be problematic.
The last column ‘#slices’ is the number of times the process has gotten on cpu.
If you look at the example output above, you see that the process started running, and remained running until sequence number 4 (sequence number is the second column). Sequence number 4 is an Oracle wait event, number 212 (direct path read). The earlier wait event number 384 was passed without actually waiting; total time is 5us, on cpu was 5us too (!). The total time spent in the wait event in sequence #4 is 58us, of which 41us was spent on cpu, and 17us off cpu. The off cpu time is composited of 5us run queue time (q) and 9us ‘TASK_INTERRUPTIBLE’ time, of which the total is 14us, which leaves 3us off cpu/unaccounted for. This is time taken by the state transitions and context switches. The actual sequence of events of the CPU state is: TASK_RUNNING (on cpu), then TASK_INTERRUPTIBLE is entered, which is actually waiting for IOs in this case (wait event ‘direct path read’, remember?). The ‘TASK_INTERRUPTIBLE’ state means the process is stopped from processing by the kernel (taken off cpu), which is logical, because it means the process is deliberately waiting for something before it can continue. Once the condition is met (IO(s) ready in this case), the process can continue. To continue, the process state is set to ‘TASK_RUNNING’, and put on a runqueue. This means there is no explicit process state ‘in run queue’. This state (state set to ‘TASK_RUNNING’ but not running on CPU yet) is shown with ‘q’. Once the process has enough priorities, the scheduler switches the process running on the CPU again.

Okay, so at this point we have a (systemtap) script that can very precisely count the time spend of a process. Wouldn’t it be great if we can see a flame graph per sequence number? I spent a great deal of time trying to figure out a way to combine the two. Until I learned about the ‘-T’ switch of perf record:

    -T, --timestamp       Sample timestamps

Great!! The way this works, is that perf includes ‘timestamps’ during recording (perf record), which are printed when the perf recording is externalised with the ‘perf script’ command:

oracle_92213_fv 92213 34075.900988: cycles:
        ffffffff810483da native_write_msr_safe ([kernel.kallsyms])
        ffffffff8102bf91 intel_pmu_enable_all ([kernel.kallsyms])
        ffffffff810263cc x86_pmu_enable ([kernel.kallsyms])
        ffffffff811221db perf_pmu_enable ([kernel.kallsyms])
        ffffffff81124d09 perf_event_context_sched_in ([kernel.kallsyms])
        ffffffff811252c5 __perf_event_task_sched_in ([kernel.kallsyms])
        ffffffff810962ce finish_task_switch ([kernel.kallsyms])
        ffffffff8159f81d __schedule ([kernel.kallsyms])
        ffffffff8159fec9 schedule ([kernel.kallsyms])
        ffffffff8119e56b pipe_wait ([kernel.kallsyms])
        ffffffff8119f030 pipe_read ([kernel.kallsyms])
        ffffffff81195c37 do_aio_read ([kernel.kallsyms])
        ...etc...

‘34075.900988’ is the timestamp. However, what is this number?? I searched for quite some time, and there is no clear description to be found. It clearly is not epoch time.

Some experimentation learned that -apparently- the number is seconds since startup with microsecond granularity. Further experimentation using systemtap learned that exactly the same number can be fetched with the systemtap local_clock_us() function. This makes it possible to link perf stacktraces with systemtap output!! I created a script (run_stapflame.sh) that runs perf record -g and systemtap at the same time, then combines the information from both tools (meaning the systemtap runtime data is pushed into the stack trace information), after which flame graphs are created.

When a process is not running, there will be no perf stack traces, because there is no process for perf to take the stack trace from. So only when running on CPU (TASK_RUNNING state excluding run queue time), there should be perf data. Also, the systemtap times are accurately measured, but the stack traces of perf are sampled. This means it is missing data (by definition: sampling means you are going to lookup something at a certain interval, which means you are not looking between the intervals). What I have done, is extrapolate the perf samples found for an Oracle CPU or wait interval relative to the time in the measured interval. This means that if the time in the interval is 100us, and I get two collapsed stack traces with 1 and 3 samples, the extrapolated time will be; 1: 1/(1+3)*100us=25us, and 3: 3/(1+3)*100us=75us. This is not scientific, but it is meant to give an idea. Also, in order to show something useful in flame graphs, all the data needs to be based on the same data type, so I need to transform the number of stack traces to time.

I created a github project stapflame for my files.

First, you need to install the debuginfo kernel packages, as has been mentioned in this blogpost.
Then, you need to execute eventsname.sql in order to generate eventsname.sed, which is used to translate wait event numbers to wait event names. Wait event numbers change between Oracle database versions, and could potentially change after PSU apply. eventsname.sed must be in the same directory as the run_stapflame.sh script.
Then, you need to fetch stackcollapse-perf.pl and flamegraph.pl from Brendan Gregg’s github flamegraph repository. These need to be in the same directory as the run_stapflame.sh script too, and have the execute bit set.

Once the requirements are met, you can use the run_stapflame.sh script:

# ./run_stapflame.sh 123

The first argument must the PID of an existing Oracle foreground process.
This will compile and run the systemtap script. Once both systemtap and perf are running, the following message is displayed:

profiling started, press enter to stop

Now execute what you want to be profiled in the Oracle session. Once you are done, return to the session where you ran run_stapflame.sh, and press enter to stop the profiling.
Depending on how much systemtap and perf information this generated, the script will process for some time (it is coded in bash script, it’s not highly optimised, in fact it’s quite inefficient with larger sets(!)).
There will be some messages from perf indicating how many samples it captured:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.213 MB perf.data (~9286 samples) ]

And systemtap complaining about missing symbols for the kernel module it just created (for itself :-)):

No kallsyms or vmlinux with build-id 532f482ae316500802639d35de6c302fdb685afa was found
[stap_b6a486219fd483305d4991b483feb397_16565] with build id 532f482ae316500802639d35de6c302fdb685afa not found, continuing without symbols

The stapflames in the example mentioned here are of a simple ‘select count(*) from table’ in oracle, resulting in a direct path read.

This is the resulting flame graph of my original idea. In short, this does not look very useful in this case, and might never be useful as soon as you got more than approximately 20-100 cpu-wait combinations.

However, other flame graphs are more useful; look at this flame graph, it’s about process CPU state (stack traces grouped by process CPU state, which means that Oracle events (ON CPU and the wait events) can be found for every CPU state). It’s nice to see that in the kernel state TASK_RUNNING (which is on CPU, RUNQUEUE is specified independently) is 46.86% of the total time. Of this time, almost all time (40.35% of the total time) is in the Oracle state ‘ON CPU’. After TASK_RUNNING, TASK_INTERRUPTIBLE time is 49.56% of the total time. Almost all time in TASK_INTERRUPTIBLE has gone to the wait event direct path read. There is not a great deal of time spend in the run queue (1.74% of the total time). If you look up through the stacktraces in TASK_RUNNING and ON CPU, you see that of the 40.35% of running on CPU, 15.33% is spend on executing the io_submit function. This means 15.33/40.35*100=37.99% of the time on CPU is spend on submitting IO requests. Also in TASK_RUNNING and ON CPU, 7.57% is spend in the function sxorchk, which is block checksumming (db_block_checksum; set to ‘typical’ by default).

Another interesting view is this flamegraph; this one is only the wait event ‘direct path read’. As you can see, most of the wait event is not spent running on CPU. That is quite logical for a wait event :). 96.03% of the time spent in the wait event ‘direct path read’ is in the ‘TASK_INTERRUPTIBLE’ state. 2.68% of the time in the wait event ‘direct path read’ is spend in TASK_RUNNING on CPU state. Despite being inside wait event time, it’s very logical to have some time spend on running on CPU, because you need to run on the CPU to reap IO requests.

I love to hear comments, additions, corrections or any other feedback!

Credits and other references:
# Brendan Gregg and his work on FlameGraphs.
# Luca Canali and his work on stack profiling, which inspired me to create this tool, and proofreading.
# Tanel Poder and his work on Oracle internals in general.
# Klaas-Jan Jongsma for proofreading.

Footnotes:
– The systemtap script contains the full path to the Oracle database executable in the userspace function probes. This obviously must be changed to reflect the path to the Oracle executable of the process the systemtap script is profiling. Alternatively, it can just list the executable name (“oracle”), but then it the executable must be in the $PATH.

Tagged: dtrace, flamegraph, oracle, performance, profile, stack profiling, stack trace, stap, systemtap

David Bowie 1947-2016. My Memories.

In mid-April 1979, as a nerdy 13 year old, I sat in my bedroom in Sale, North England listening to the radio when a song called “Boys Keep Swinging” came on by an singer called David Bowie who I never heard of before. I instantly loved it and taped it next time it came on the radio via my […]

Subquery Effects

Towards the end of last year I used a query with a couple of “constant” subqueries as a focal point for a blog note on reading parallel execution plans. One of the comments on that note raised a question about cardinality estimates and, coincidentally, I received an email about the cost calculations for a similar query a few days later.

Unfortunately there are all sorts of anomalies, special cases, and changes that show up across versions when subqueries come into play – it’s only in recent versions of 11.2, for example, that a very simple example I’ve got of three equivalent statements that produce the same execution plan report the same costs and cardinality. (The queries are:  table with IN subquery, table with EXISTS subquery, table joined to “manually unnested” subquery – the three plans take the unnested subquery shape.)

I’m just going to pick out one particular anomaly, which is a costing error with multiple subqueries when “OR-ed”. Here’s my sample data set:


create table t1
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 20000
;


create table t2
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 25000
;

create table t3
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 30000
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t1',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t2',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t3',
                method_opt       => 'for all columns size 1'
        );
end;
/

The three tables are slightly different sizes so that it will be easy to see different costs of tablescans, and there are no indexes to everything I do in the queries will be tablescans. Here are six queries I’m going to test – they all scan t1, with “constant” subqueries against t2 and/or t3. The first pair is just to show you the basic cost of the query with a single subquery, the second pair shows you the default action with two subqueries in two different orders, the final pair shows you what happens with two subqueries when you block subquery pushing.


select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n2 > (select avg(t2.n2) from t2)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n3 > (select avg(t3.n3) from t3)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select avg(t2.n2) from t2)
         or t1.n3 > (select avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select avg(t3.n3) from t3)
         or t1.n2 > (select avg(t2.n2) from t2)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
         or t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
         or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
        )
;

Here are the first two plans, pulled from memory (which you might have guessed thanks to the “disappearing subquery predicate” in the predicate section. These examples came from 12.1.0.2, but the same happens in 11.2.0.4:


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N2">))

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   123 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N3">))

As you can see, the cost of the query is the cost of the t1 tablescan plus the cost of running the t2 or t3 subquery once: 111 = 49 + 62, and 123 = 49 + 74.

(As a general guideline, recent versions of the optimizer tend to allow for subqueries by including “cost of subquery” * “number of times the optimizer thinks it will execute” – in this case the optimizer knows that the subquery will run exactly once).

But what happens when we test the query that applies BOTH subqueries to the tablescan ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N2"> OR "T1"."N3">)))


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N3"> OR "T1"."N2">)))

The cost of the query in both cases is just the cost of the tablescan of t1 – the subqueries are, apparently, free. You can check from the predicate section, by the way, that the subqueries are applied in the order they appear in original statement.

Does anything change if the subqueries are not pushed ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N2"> OR "T1"."N3">))
   3 - filter("T1"."N1">10000)

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   124 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N3"> OR "T1"."N2">))
   3 - filter("T1"."N1">10000)

The two plans have different costs – and the cost is the cost of the tablescan of t1 plus the cost of just the first subquery in the filter predciate list.

The non-pushed subqueries show up another anomaly: you’ll notice that the t1 tablescan reports 10,001 rows cardinality, but the FILTER operation doesn’t have an associated cardinality so we can’t see how many rows the optimizer thinks will survive the subqueries. So let’s run a query that allows us to see the surviving row estimate:


select
        max(n1)
from
        (
        select
                /*+ no_eliminate_oby */
                t1.n1
        from
                t1
        where
                t1.n1 > 10000
        and     (
                   t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
                or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
                )
        order by
                n1
        )
;

-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |       |       |   126 (100)|          |
|   1 |  SORT AGGREGATE        |      |     1 |    13 |            |          |
|   2 |   VIEW                 |      | 10001 |   126K|   126   (5)| 00:00:01 |
|   3 |    SORT ORDER BY       |      | 10001 |   146K|   126   (5)| 00:00:01 |
|*  4 |     FILTER             |      |       |       |            |          |
|*  5 |      TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   6 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |       TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   8 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   9 |       TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("T1"."N3"> OR "T1"."N2">))
   5 - filter("T1"."N1">10000)

As you can see, the SORT ORDER BY operation thinks it’s going to handle 10,0001 rows – it looks as if the optimizer arithmetic hasn’t applied the usual subquery guess of 5% for the two subqueries. (When the subqueries were automatically pushed you saw a cardinality of 975 – which is 5% for subquery t2 plus (due to OR) 5% for subquery t3 minus 5% of 5% (=25) for the overlap – this is the standard OR arithmetic)

tl;dr

Although the optimizer code has been enhanced in many places for dealing with subquery estimates, but there are still some odd errors and inconsistencies that you need to be aware of. The examples I’ve shown may not be particularly significant in terms of what they do, but the pattern is one that you may recognise in more complex queries.

 

Reference script: subq_cost_anomaly_2.sql

 

MobaXterm 8.5

MobaXterm 8.5 was released at then end of last year. The download and changelog can be found here.

If you use a Windows desktop and you ever need to SSH a server or use X emulation, you need this bit of software in your life. There is a free version, or you can subscribe to support the development.

Give it a go. You won’t regret it.

Cheers

Tim…


MobaXterm 8.5 was first posted on January 11, 2016 at 9:28 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.