Search

OakieTags

Who's online

There are currently 0 users and 35 guests online.

Recent comments

Oakies Blog Aggregator

Collaborate ’15

It’s time for another phenomenal Collaborate conference next week, made even more fabulous by the fact that I’ll be there for the first time! :) For some reason, Collaborate is one of the few big user group conferences I haven’t made it to before, so when my boss asked me to go and present at it I leapt at the chance.

Originally, I had one presentation timeslot, but then Alex Gorbachev of Pythian / Oak Table fame noticed I was attending and asked me to do a couple of slots at Oak Table World on the Wednesday. Then the organizers had a drop-out and asked Seth Miller and I to fill it in, so my dance card is starting to look a bit more full! So dates and times for when I will be presenting currently stand at:

  • Enterprise Manager Cloud Setup and PDBaaS – Seth and me. Banyan B => Tue, April 14 12:15 PM – 12:45 PM
  • Knowledge Sharing – Why Do It? Mandalay Ballroom K => Wed, April 15 10:30 – 11:10 AM
  • SnapClone++. Mandalay Ballroom K => Wed, April 15, sometime between 2:00 PM and 3:00 PM
  • Data Clone and Refresh Made Easy with Enterprise Manager 12c Snap Clone. Palm D => Thu, April 16, 2015 11:00 AM – 12:00 PM

The first session is also part of the virtual conference. Seth will spend 15 minutes on a quick overview of setup required to support cloud in EM and then I’ll cover PDBaaS for the remaining time.

The Knowledge Sharing session is a different sort of talk for me. For a start, it’s non-technical. Part of it will cover the sorts of knowledge sharing I’ve been doing over the past 20+ years, and then I’ll branch into the pros and cons of a few different types of knowledge sharing. It should be an interesting experience!

The SnapClone++ session is a lightning talk. As Alex describes them, “Lightning Talks are 10-minute presentations done in a rapid-fire fashion. They are always a huge success—you’ve got to be there! They may be technical, motivational, or inspirational, but regardless they are always cool speeches. The sequence of the talks may change, but everything will be presented within the hour.”

The final one is back to the normal sorts of presentations I’ve been doing for many years. It will cover why you would want to use Snap Clone, and includes a demo of how the functionality actually works.

Apart from all that, I’ll be spending some time on the Demo Grounds, as well as venturing into some of the EM SIG meetings. Hopefully at some time you can catch up and have a chat, so come on by! As I’m still travelling in a wheelchair, it should be easy to spot me. :)

Counting

There’s a live example on OTN at the moment of an interesting class of problem that can require some imaginative thinking. It revolves around a design that uses a row in one table to hold the low and high values for a range of values in another table. The problem is then simply to count the number of rows in the second table that fall into the range given by the first table. There’s an obvious query you can write (a join with inequality) but if you have to join each row in the first table to several million rows in the second table, then aggregate to count them, that’s an expensive strategy.  Here’s the query (with numbers of rows involved) that showed up on OTN; it’s an insert statement, and the problem is that it takes 7 hours to insert 37,600 rows:


    INSERT INTO SA_REPORT_DATA
    (REPORT_ID, CUTOFF_DATE, COL_1, COL_2, COL_3)
    (
    SELECT 'ISRP-734', to_date('&DateTo', 'YYYY-MM-DD'),
           SNE.ID AS HLR
    ,      SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER AS NUMBER_RANGE
    ,      COUNT(M.MSISDN) AS AVAILABLE_MSISDNS
    FROM
           SA_NUMBER_RANGES SNR          -- 10,000 rows
    ,      SA_SERVICE_SYSTEMS SSS        --  1,643 rows
    ,      SA_NETWORK_ELEMENTS SNE       --    200 rows
    ,      SA_MSISDNS M                  --    72M rows
    WHERE
           SSS.SEQ = SNR.SRVSYS_SEQ
    AND    SSS.SYSTYP_ID = 'OMC HLR'
    AND    SNE.SEQ = SSS.NE_SEQ
    AND    SNR.ID_TYPE = 'M'
    AND    M.MSISDN  >= SNR.FROM_NUMBER
    AND    M.MSISDN  <= SNR.TO_NUMBER
    AND    M.STATE  = 'AVL'
    GROUP BY
           SNE.ID,SNR.FROM_NUMBER||' - '||SNR.TO_NUMBER
    )  

The feature here is that we are counting ranges of MSISDN: we take 10,000 number ranges (SNR) and join with inequality to a 72M row table. It’s perfectly conceivable that at some point the data set expands (not necessarily all at once) to literally tens of billions of rows that are then aggregated down to the 37,500 that are finally inserted.

The execution plan shows the optimizer joining the first three tables before doing a merge join between that result set and the relevant subset of the MSISDNs table – which means the MSISDNs have to be sorted and buffered (with a probably spill to disc) before they can be used. It would be interesting to see the rowsource execution stats for the query – partly to see how large the generated set became, but also to see if the ranges involved were so large that most of the time went in constantly re-reading the sorted MSISDNs from the temporary tablespace.

As far as optimisation is concerned, there are a couple of trivial things around the edges we can examine: we have 10,000 number ranges but insert 37,600 results, and the last stages of the plan generated those results so we’ve scanned and aggregated the sorted MSISDNs 37,600 times. Clearly we could look for a better table ordering that (eliminated any number ranges early), then did the minimal number of joins to MSISDN, aggregated, then scaled up to 37,600: with the best join order we might reduce the run time by a factor of 3 or more. (But that’s still a couple of hours run time.)

What we really need to do to make a difference is change the infrastructure in some way – prefereably invisibly to the rest of the application. There are a number of specific details relating to workload, read-consistency, timing, concurrency, etc. that will need to be considered, but broadly speaking, we need to take advantage of a table that effectively holds the “interesting” MSISDNs in sorted order. I’ve kept the approach simple here, it needs a few modifications for a production system. The important bit of the reports is the bit that produces the count, so I’m only going to worry about a two-table join – number ranges and msidn; here’s some model data:


execute dbms_random.seed(0)

create table msisdns
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      msisdn
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table number_ranges
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1e9,1e10))      from_number,
        trunc(dbms_random.value(1e9,1e10))      to_number
from
        generator       v1
where
        rownum  <= 1000
;

update number_ranges set
        from_number = to_number,
        to_number = from_number
where
        to_number < from_number
;

commit;

I’ve created a table of numbers with values between 10e9 and 10e10 to represent 1 million MSISDNs, and a list of 1,000 number ranges – making sure that the FROM number is not greater than the TO number. Now I need a “summary” table of the MSISDNs, which I’m going to create as an index-organized table:


create table tmp_msisdns (
        msisdn,
        counter,
        constraint tmp_pk primary key (msisdn, counter)
)
organization index
as
select
        msisdn,
        row_number() over(order by msisdn)      counter
from
        msisdns
;

This is only a demonstration so I’ve haven’t bothered with production-like code to check that the MSISDNs I had generated were unique (they were); and I’ve casually included the row_number() as part of the primary key as a performance fiddle even though it’s something that could, technically, allow some other program to introduce bad data if I made the table available for public use rather than task specific.

Finally we get down to the report. To find out how many MSISDN values there are between the FROM and TO number in a range I just have to find the lowest and highest MSISDNs from tmp_msisdn in that range and find the difference between their counter values, and add 1. And there’s a very fast way to find the lowest or highest values when you have the appropriate index – the min/max range scan – but you have to access the table twice, once for the low, once for the high. Here’s the necessary SQL, with execution plan from 12.1.0.2:


select
        nr.from_number, nr.to_number,
--      fr1.msisdn, fr1.counter,
--      to1.msisdn, to1.counter,
        1 + to1.counter - fr1.counter range_count
from
        number_ranges   nr,
        tmp_msisdns     fr1,
        tmp_msisdns     to1
where
        fr1.msisdn = (
                select min(msisdn) from tmp_msisdns where tmp_msisdns.msisdn >= nr.from_number
        )
and     to1.msisdn = (
                select max(msisdn) from tmp_msisdns where tmp_msisdns.msisdn <= nr.to_number
        )
;

-------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |               |       |       |  4008 (100)|          |
|   1 |  NESTED LOOPS                   |               |  1000 | 38000 |  4008   (1)| 00:00:01 |
|   2 |   NESTED LOOPS                  |               |  1000 | 26000 |  2005   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL            | NUMBER_RANGES |  1000 | 14000 |     2   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN             | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   5 |     SORT AGGREGATE              |               |     1 |     7 |            |          |
|   6 |      FIRST ROW                  |               |     1 |     7 |     3   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN (MIN/MAX)| TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN              | TMP_PK        |     1 |    12 |     2   (0)| 00:00:01 |
|   9 |    SORT AGGREGATE               |               |     1 |     7 |            |          |
|  10 |     FIRST ROW                   |               |     1 |     7 |     3   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN (MIN/MAX) | TMP_PK        |     1 |     7 |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("FR1"."MSISDN"=)
   7 - access("TMP_MSISDNS"."MSISDN">=:B1)
   8 - access("TO1"."MSISDN"=)
  11 - access("TMP_MSISDNS"."MSISDN"<=:B1)

Execution time – with 1 million MSISDNs and 1,000 ranges: 0.11 seconds.

For comparative purposes, and to check that the code is producing the right answers, here’s the basic inequality join method:


select
        nr.from_number, nr.to_number, count(*) range_count
from
        number_ranges   nr,
        msisdns         ms
where
        ms.msisdn >= nr.from_number
and     ms.msisdn <= nr.to_number
group by
        nr.from_number, nr.to_number
order by
        nr.from_number
;

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |               |       |       |       |   472K(100)|          |
|   1 |  HASH GROUP BY        |               |   707K|    14M|  6847M|   472K (17)| 00:00:19 |
|   2 |   MERGE JOIN          |               |   255M|  5107M|       | 13492  (77)| 00:00:01 |
|   3 |    SORT JOIN          |               |  1000 | 14000 |       |     3  (34)| 00:00:01 |
|   4 |     TABLE ACCESS FULL | NUMBER_RANGES |  1000 | 14000 |       |     2   (0)| 00:00:01 |
|*  5 |    FILTER             |               |       |       |       |            |          |
|*  6 |     SORT JOIN         |               |  1000K|  6835K|    30M|  3451   (7)| 00:00:01 |
|   7 |      TABLE ACCESS FULL| MSISDNS       |  1000K|  6835K|       |   245  (14)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter("MS"."MSISDN"<="NR"."TO_NUMBER")
   6 - access("MS"."MSISDN">="NR"."FROM_NUMBER")
       filter("MS"."MSISDN">="NR"."FROM_NUMBER")

The two queries produced the same results (apart from ordering); but the second query took 2 minutes 19.4 seconds to complete.

 

Update:

In a moment of idle curiosity I recreated the data with 40 Million rows in the MSISDNs table to get some idea of how fast the entire report process could go when re-engineered (remember the OP has 72M rows, but select the subset flagged as ‘AVL’). It took 1 minute 46 seconds to create the IOT – after which the report for 1,000 number ranges still took less than 0.2 seconds.

 

 

 

 

 

Avoiding the COMMIT bomb!

I’m an Oracle dinosaur, so I like using SQL Plus.  Its simple, fast, comes with every version and installation and platform, and I’m very familiar with it.  (And who knows, it might still be at the forefront of the Oracle development teams!  http://www.slideshare.net/hillbillyToad/sqlcl-overview-a-new-command-line-interface-for-oracle-database )

But there is one important thing I always take care of when I’m using SQL Plus, and it’s easiest to explain with an example.

You start off by wanting to delete a couple of rows from a critical table as part of a patching process.  You type this:

image

Uh oh….. I forgot the WHERE clause.  That’s not so good…

Now, if I keep my cool, I can let that command finish and then type rollback, and the damage is limited to potentially blocking other users for a little while.

But of course, what normally happens, is that you see your command, your jaw drops to the floor, and you grab your mouse, race up to the top left or right corner and click to close that window

image

 

Phew !  Crisis averted – your statement never finished, so Oracle will happily clean up after you and roll all the deletions back.

But that is a very  very risky thing to do … Because what if JUST WHEN YOU CLICK, the delete completes, and the screen looks like this:

image

Well… by default, when you exit SQL Plus, it will commit any outstanding changes.  And since your delete has completed… you just wiped out your table, and you’re off to LinkedIn looking for a new place to work :-(

There is a better way.  In your login.sql file for SQL Plus, make sure you always have the following:

set exitcommit off

You never, ever want to be committing (with any tool) unless you explicitly request it.

Video Tutorial: XPLAN_ASH Active Session History - Part 5

#333333; font-family: Verdana, Arial, sans-serif; font-size: 13px; line-height: 16.8999996185303px;">The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.
#333333; font-family: Verdana, Arial, sans-serif; font-size: 13px; line-height: 16.8999996185303px;" />
#333333; font-family: Verdana, Arial, sans-serif; font-size: 13px; line-height: 16.8999996185303px;" />#333333; font-family: Verdana, Arial, sans-serif; font-size: 13px; line-height: 16.8999996185303px;">More parts to follow.

DBAKevlar at Collaborate 2015

Here’s a quick guide to my schedule of sessions at IOUG Collaborate 2015 for this year.  I’m looking forward to seeing everyone next week, (I’ll be arriving on Saturday, so I really should say in a couple days) and have no doubt we’re up for another phenomenal conference at Las Vegas, this year at the Mandalay!

Additionally, there are some great opportunities in professional development at the IOUG Strategic Leadership Program.  I’ll be speaking with James Lui on how to improve your value with personal brands and social media.

I’m the WIT luncheon speaker on Monday, so for those of you that signed up for this great, yearly event for women in tech before it sold out, I’m looking forward to meeting as many of the powerful women in our industry, (and hopefully a few men, too!) before and after the event.

There are also a few SIG’s that I’ll be attending that aren’t on the schedule:

Tue-12:30 p.m. – 1:00 p.m.Cloud Computing and Virtualization SIG Meeting (ID: 943)

Wed-12:30 p.m. – 1:00 p.m.Oracle Enterprise Manager SIG Meeting (ID: 949)

Id Number Date and Time Session Title Product Line(s) Session Track(s) Session Room
976 Sun. Apr. 12
9:00 am – 1:00 pm
Everything I Needed to Know About Enterprise Manager I Learned at COLLABORATE – Hands-on Lab Oracle Enterprise Manager Manageability Palm D
0 Mon. Apr. 13
9:15 – 10:15 am
Zero to Manageability in One Hour: Build a Solid Foundation for Oracle Enterprise Manager 12c Oracle Cloud – DBaaS/PaaS| Oracle Database| Oracle Enterprise Manager Manageability Banyan B
112 Mon. Apr. 13
3:15 – 4:15 pm
The Power of the AWR Warehouse Oracle Enterprise Manager Manageability Banyan B
967 Mon. Apr. 13
4:30 – 5:30 pm
IOUG Strategic Leadership Program: Staying on Top of Key Trends and Growing Your Personal Brand with Social Media Applicable to All Professional Development Banyan E
1003 Wed. Apr. 15
2:00 – 3:00 pm
OakTable: TED-Style Lightning Talks (2:00-3:00pm) Applicable to All Database Mandalay K
986 Mon. Apr. 13
12:45 – 1:45 pm
Twitter 101 for Oracle Professionals Applicable to All| Professional Development Banyan D

Thanks to everyone at IOUG for the opportunity to be so involved with this great conference and see everyone soon!



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [DBAKevlar at Collaborate 2015], All Right Reserved. 2015.

Data Cloning and Refresh

For some time now, I’ve been creating blog posts that walk you through using some of the Enterprise Manager technology to perform specific tasks. For example, there’s one here on setting up Chargeback in EM 12.1.0.4. I’ve never been really satisfied with the way these blog posts turn out, as to document a step by step process like this takes lots of screenshots and you end up with a post that’s a mile long. It also gives the impression that doing something can be quite complex, when in fact it might only take a few minutes to perform all those steps.

In this blog post, I thought I’d take a different approach. Recently, I recorded a couple of screenwatches for use on OTN. One was on creating a test master database with continuous flow using a Data Guard standby database (available here) and the other on how to use Snap Clone outside the cloud environment (available here).

Both those URL’s stream off YouTube. If you want to download them and play them back off your own computer, here are the relevant links:

Continuous Flow – the relevant checksums if you want to make sure it has downloaded properly are:

CRC-32: f4fb51db
MD4: 8c3fbefff8cb8c464bf946739d30121b
MD5: a40a2a8232e53265b04ce789f8fa5738
SHA-1: ceafcb9e2d395a7756f25799fb58e4a7c11dbd10

Snap Clone outside the cloud environment – the relevant checksums if you want to make sure it has downloaded properly are:

CRC-32: 9e0cea20
MD4: dd9d87ac4fc78a65d1ef79cb93c3b690
MD5: 9101f17ed289714aeee6b42974abc085
SHA-1: 63c2e2228b4e33a7a0552484903b51883fc80da5

The screenwatches both start with some explanation of what you can achieve using EM to clone data in the way referred to in the video, and then walk you through the process of actually doing it. Of course, the only problem with loading videos (apart from the size) instead of screenshots is you will need to manually pause them if you want to stop and read the details of the screen. In any case, try them out and let me know (post a comment below if you like) which way you prefer – long step by step screenshot posts, or these shorter video versions.

Tuning Database XQuery Statements (2)

So we concluded the post “Tuning Database XQuery Statements (1)” with the following SQL statement…

Oaktable World Las Vegas April 15, 2015 at Collaborate

oaktableworld_lasvegas1

 

Wednesday April 15 at Collaborate 2015 Las Vegas room Mandalay K
For more information see Pythian’s Blog post.

Screen Shot 2015-04-08 at 10.03.53 AM

 

Screen Shot 2015-04-08 at 10.06.48 AM

 

 

What is Oaktable World ? Oaktable World is a day organized by members of the Oaktable network. The Oaktable network is a network of Oracle database tuning geeks. Among the members are Tanel Poder, Jonathan Lewis, Cary Millsap etc.  Oaktable network was created by Mogens Nørgaard back in 2001 or so and Mogens started organizing underground get togethers, known as Oracle Closed World, during Oracle Open World in 2007, usually at Mogens’ “office” at Chevys. In 2010  Oracle Closed World became a bit more organized with T-shirts as shown below. In 2012 Oracle kindly informed us we were not allowed to use Oracle in the groups meeting name, so it was changed to Oaktable World. Since 2012 Oaktable Worlds have been hosted at Oracle Open World, UKOUG, Collaborate and DOAG.

 

OTW_2010

 

 

OTW_sponsors

Hey dude, where’s my memory? part 2

In my previous article I started exploring the memory usage of a process on a recent linux kernel (2.6.39-400.243.1 (UEK2)), recent means “recent for the Enterprise Linux distributions” in this context, linux kernel developers would point out that the kernel itself is at version 3.19 (“stable version” at the time of writing of this blogpost).

The previous article showed that every process has its own address space, and that different allocations exists for execution. These allocations can be seen in the proc pseudo filesystem, in a process specific file called ‘maps’. The process itself needs some administration area’s which are anonymous allocations which are marked with [heap] and [stack], and some a few others called [vdso] and [vsyscall]. Every process executes something (not all people realise that: a process can wait for something, but essentially is always executing). So there always will be an executable in a (regular) process’ address space. In a lot of cases, the executable uses shared libraries. In that case, the libraries are loaded in the address space of the process too, alongside the executable.

The executable contains (at least) two sections; the code segment, which is readonly and potentially shared, and the data segment, which is read write and gets truly allocated instead of used shared with the parent if the process needs to write. In a lot of cases, the executable uses shared libraries, which means it uses functions which are stored in that library. A library also needs to be loaded, and also contains multiple sections, which can be read only or read write, and are shared unless the process needs to write to that segment.

For completeness, here’s the complete maps output of a process executing the ‘cat’ executable again:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

If you look closely, you will see that I didn’t explain one type of allocation yet: the anonymous allocations. The anonymous allocations are visible in lines 4, 11, 13 and 16. The anonymous mappings directly following the data segment of either an executable or a library mapped into a process address space is called the BSS. The data segment and the BSS store static variables, however the data segment stores initialised variables, the BSS stores uninitialised variables. The anonymous mapping for the BSS section might exist for a library, as seen above, or might not exist; not all Oracle database executable libraries use an anonymous memory mapping for example. Actually, there is one other memory allocation visible, /usr/lib/locale/locale-archive, which is a file for locale (multi-language support) functions in the C library, which is out of scope for this article.

When a process requests memory to store something, the system call malloc() (memory allocation) can be called. This system call inspects the size of the allocation, and will allocate memory from either the process’ heap (the memory mapping with [heap], using the system call brk()) or it will allocate space using a new anonymous memory segment, using the system call mmap(). If you follow the link with malloc(), you can read the source code of the malloc() call. There are different malloc()’s, which fulfil different purposes (embedded devices have different requirements than huge servers), the implementation that Enterprise Linuxes use is one called ptmalloc2, which is based on a version written bij Doug Lea. If you read the comments in the source code, specifically at ‘Why use this malloc?’, you will see that it tries to be smart with requests up to 128KB (for memory re-usability, to avoid fragmentation and memory wastage), which are allocated from the heap. If an allocation is larger than 128KB, it will use the system memory mapping facilities.

Okay, this brings us back at the original question: how much memory does this process take? I hope you recall from the first blogpost that Linux tries to share as much memory as possible, and when a new process is created, the allocations in the address space of this new process are pointers to the memory areas of the parent process. Let’s first use a utility a lot of people are using: top.

top - 10:53:40 up 9 days, 14:11,  2 users,  load average: 1.34, 1.36, 1.37
Tasks: 1124 total,   1 running, 1123 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.8%sy,  0.1%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98807316k total, 97411444k used,  1395872k free,   400288k buffers
Swap: 25165820k total,  3560852k used, 21604968k free, 27573200k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 16391 oracle    -2   0 2369m 4960 4812 S  0.7  0.0  91:41.37 asm_vktm_+asm1
 16407 oracle    20   0 2388m  20m 8580 S  0.7  0.0  46:16.40 asm_dia0_+asm1

I edited the output a bit to show only two process from an ASM instance.
The columns that show information on memory are VIRT, RES, SHR, %MEM.

VIRT is described in the man-page of top as ‘The total amount of virtual memory used by the task’. This means it’s ALL the memory visible in the addressing space of the process. A useful utility to get the contents of the virtual memory allocations for a process is pmap, let’s use it for process 16391, which is asm_vktm_+asm1:

$ pmap -x 16391
16391:   asm_vktm_+ASM1
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  246644    2520       0 r-x--  oracle
000000000f6dd000    1952      44      20 rw---  oracle
000000000f8c5000     140       0       0 rw---    [ anon ]
000000001042c000     356       0       0 rw---    [ anon ]
0000000060000000    4096       0       0 rw-s-  SYSV00000000 (deleted)
0000000061000000 2080768       0       0 rw-s-  SYSV00000000 (deleted)
00000000e0000000      12       4       4 rw-s-    [ shmid=0x4e78003 ]
00007fa275589000      72      12       0 r-x--  libnfsodm12.so
00007fa27559b000    2044       0       0 -----  libnfsodm12.so
00007fa27579a000       8       0       0 rw---  libnfsodm12.so
00007fa27579c000    1604       0       0 r-x--  libshpksse4212.so
00007fa27592d000    2044       0       0 -----  libshpksse4212.so
00007fa275b2c000      72       0       0 rw---  libshpksse4212.so
00007fa275b3e000      20       4       0 r-x--  libcxgb3-rdmav2.so
00007fa275b43000    2044       0       0 -----  libcxgb3-rdmav2.so
00007fa275d42000       4       0       0 rw---  libcxgb3-rdmav2.so
...snip...
00007fa27abd5000       8       0       0 r-x--  libodmd12.so
00007fa27abd7000    2044       0       0 -----  libodmd12.so
00007fa27add6000       4       0       0 rw---  libodmd12.so
00007fa27add7000     128     116       0 r-x--  ld-2.12.so
00007fa27adf8000     512      12      12 rw---    [ anon ]
00007fa27ae78000     212       4       4 r--s-  passwd
00007fa27aead000    1260      68      68 rw---    [ anon ]
00007fa27aff3000       4       4       0 rw-s-  hc_+ASM1.dat
00007fa27aff4000       8       0       0 rw---    [ anon ]
00007fa27aff6000       4       0       0 r----  ld-2.12.so
00007fa27aff7000       4       0       0 rw---  ld-2.12.so
00007fa27aff8000       4       0       0 rw---    [ anon ]
00007fff7b116000     132       8       8 rw---    [ stack ]
00007fff7b1ff000       4       4       0 r-x--    [ anon ]
ffffffffff600000       4       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB         2426668    5056     156

The column ‘Kbytes’ represents the full size of the executable, libraries, shared memory, anonymous mappings and other mappings of this process. For completeness sake: 2426668/1024=2369.79, which matches the 2369 in the top output. This is all the memory this process can see, and could use. Does this tell us anything on what memory the process 16391 actually takes? No. (parts of) the Oracle executable’s allocations are potentially shared, the shared memory (SYSV00000000 (deleted) and [ shmid=0x4e78003 ], which represent the Oracle SGA) is shared, the memory allocations for the libraries are potentially shared. The anonymous memory mappings have been defined, but the actual allocation is not visible in the Kbytes column. What this column in top does for me, is tell the approximate SGA size, especially if the SGA is larger (meaning multiple gigabytes).

The second memory column in top is RES. RES is described as: ‘The non-swapped physical memory a task is using’. RES is sometimes referred to as RSS, and called ‘resident set size’. As we can see from the total, the RSS is way lesser than the virtual memory size. One important thing in the RES description of top is that it described that swapped memory pages are not counted for the RES/RSS value. RES/RSS corresponds to the actual used (“touched”) memory by a process, and is directly usable. If you look back to the RSS column of pmap above, you see the oracle executable’s two mappings, one has a RSS size of 2520, and one has a RSS size of 44. But…if you remember that the code/readonly segment is potentially shared with other process, and then look at the 2520 value (which is of the oracle memory segment with the rights r-x–, which means the code segment), I hope you understand this just means this process (vktm) read a subset of the entire executable, and more importantly: the RSS size does not reflect physical memory uniquely allocated by this process.

If we look at the shared memory segments, it’s interesting to see what happens during normal life of a database session. I think it is needless to say that you should calculate shared memory outside of process memory usage, since it’s a distinct memory set that is truly shared by all the processes that are created for the instance.

This is a session which has just started:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     348     348 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848    2316    2316 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592       0       0 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

This instance has a SGA set to 4G. Because the session just started, it only touched 2316(KB) of the SGA. Next, I do a big (buffered!) full table scan, requiring the session to put a lot of blocks into the buffercache. After the scan, look at the shared memory segment using pmap again:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     384     384 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848 2279040 2279040 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592   66564   66564 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

The session has touched half of the SGA shared memory segment (visible in the RSS column of the 6th line). This is logical if you understand what is going on: the process does a buffered table scan, which means the blocks read from disk need to be stored in the buffer cache, which is one of the memory structures in the Linux shared memory segments. However, if you look strictly at the top utility output of a database that has just started up, you see the RSS size of the all the processes that work with the buffercache growing. This phenomenon has lead to a repetitive question on the Oracle Linux forums if Oracle database processes are leaking memory. Of course the correct answer is that the RSS size just grows because the process just touches more of the shared memory (=SGA) that has been mapped into its address space. It will stop increasing once it touched all the memory it could touch.

%MEM is the RES/RSS size expressed as a percentage of the total physical memory.

SHR is the amount of shared memory. The manpage of top says ‘It simply reflects memory that could be potentially shared with other processes’. Do not confuse this with shared memory segments mapped into the process’ address space. Empirical tests show the SHR value always seems to be lower than the RSS size, which means it seems to track the RSS value of memory, and shows RSS (touched) memory that could be shared (which seems to contain both touched memory from the shared memory segments, as well as privately mapped executable and libraries). At least from the perspective of an Oracle performance and troubleshooting perspective I can’t see any benefit from using this value.

The conclusion of this part on memory usage is that both the virtual set size (VIRT/VSZ) and resident set size (RES/RSS) are no figures you can add up to indicate physical memory usage of a single process or a group of processes.

The virtual set size gives you the total amount of virtual memory available to a single process, which includes the executable, and potentially shared libraries, anonymous memory mappings and files mapped into the address space and shared memory. In a lot of cases, you get an indication of the total SGA size of the instance, because the shared memory segments which contain the SGA are entirely mapped into the address space of the process.

The resident set size shows you how much of the memory and files mapped into the address space are actually “touched”, and directly usable. Some of the memory usage result in memory pages private to the process, because of writing to the memory and the Copy On Write mechanism of the Linux memory manager. A lot of other memory mappings can be used, increasing the resident set size, while these are shared with other processes. A third potential component is the actual usage of memory as a result of anonymous memory mappings (versus the total allocation, which can be much more), which are private to the process.

Tuning Database XQuery Statements (1)

I had a question of a colleague to have a look at a statement that…