It’s really hard for me to make a judgement about Chapterhouse: Dune. On the one hand there are some excellent characters and the general story line is great. On the other, there are parts I found really boring. I got a bit sick of the teasers without any explanation. At first is was intriguing, but as they continued I just got a bit fed up with them and decided to stop second guessing the outcome and just let it happen. I think there are two ways an author can play this game:
1) Make the outcome fairly obvious from the start, but make the journey to get there exciting. Kind of like The Dresden Files.
2) Make the outcome a mystery, but subtly lead you in the right direction.
I think this book is trying to do the latter, but is quite clumsy about it. Having said all that, I’m glad I read it. The overall outcome is more than satisfactory.
I’m not going to read the books by Frank Herbert’s son. I’ve been told they are not good, and the brief snippets I’ve read seem to reinforce that.
I guess the end of a series of books like this needs a bit of a summary. I think the first book is a total classic. The rest you can take or leave. There are definitely interesting elements to all of them, but they are not nearly as accomplished as the first.
How do you want to start the day? I’m guessing it’s not to be called out to the front of the room by a speaker and used as a guinea pig, while they ask you trick questions to make you look stupid. Tom Kyte, you will pay. Oh yes! You will pay!!!
The sessions I attended on day 2 were:
Another very useful day indeed. I had some good feedback and interesting questions about my talks. This sort of feedback is really important when you are presenting regularly as it allows you to continuously refine your material and presenting skills. It can sometimes give you a fresh perspective on a subject, that inspires you to alter the focus of your presentations entirely. I’m very grateful to anyone who takes the time to provide this sort of feedback. Big thanks to Tom Kyte, who has given me some very useful advice over the last couple of days, but then he owes me for making me look stupid in his first session of the day!
In the evening we went out for dinner at a restaurant just down the road from the hotel. I ate plenty of cheese, so I was in heaven. Not surprisingly, much of the talk ended up being about Oracle. It may seem a little sad to some people, but when I’m surrounded by people with brains the size of a planet, I can’t help myself quizzing them about this stuff. I love it!
Great big thanks go out to Milena and her gang for organizing this event and inviting me. Thanks also to Stoyan for being my driver again. No offence to other user groups, but BGOUG conferences are my favorite events of the year. I will keep coming back as long as you will have me! Also, a big thank you to the Oracle ACE program for making this possible.
So Day 1 (part 2) didn’t go to plan because I forgot to take my camera or my phone to the party.
Suffice to say, lots of food, lots of drink (for those that do) and most importantly lots of dancing. Yes, I once again murdered the traditional dances of Bulgaria, but it’s the takling part that counts right?
I had good intentions of leaving early, but I ended up chatting about Oracle until about 02:00. Day 2 is going to be tough…
Last night we all got together to eat some food and chat. Julian Dontcheff is practically a savant where Bulgarian Poetry, World Cup match results and random Oracle facts are concerned. Although Christian Antognini was pretty impressive on the random Oracle facts too.
I didn’t have any presentations today, so I got to sit and watch. I’ve done loads of typing, mostly of syntax for 12c features, but it’s not really stuff that is worth posting, because I have no way to validate it out, so I’m just going to keep it as a reminder for when I get hold of 12c and can try it out.
The sessions I went to included:
There was a lot of material I had seen at OOW2012 and UKOUG2012, but also a lot I had not, so I’m glad I went to them. The smaller setting also made it easier to ask questions, which can be quite daunting at the big events.
Tom gave me a couple of tips that have gone straight into one of my talks for tomorrow. I’m gonna have to name check him for it, or I’ll feel like I’m passing it off as my own.
I said this after OOW2013 and I’m sure I will say it again, but the number of changes in 12c is pretty daunting. I guess the fact it’s been about a 3 year wait, rather than the normal 18 months adds to that. In many cases (but not all) it’s not the scope of the individual changes that are the issue, but the sheer volume of them. I think people are going to be blogging for a long time before they’ve got through them. It will be interesting to see what gets selected for inclusion in the OCP DBA upgrade exam.
I’m off to dinner now. I will try to get some photos and post them in “Day 1 (part 2)@ tomorrow.
The call for papers is open for Tech 13 – the “server-side” conference of the UKOUG.
The conference was getting so big that we’ve split Apps from Server Tech and will be running the two conferences separately this year. The Server Tech conference will be in Manchester from 2nd to 4th Dec.
The closing date for submissions is Friday 31st May (only 2 weeks !) and confirmation of acceptance given by August.
There is a slightly shorter route to submission (if you don’t want to watch the video on “Why to speak”).
This post is a brief discussion about the advantages of activating parallelism by altering the session environment instead of using the alternative ways (hints, DDL). The latter ways are the most popular in my experience, but I have noticed that their popularity is actually due, quite frequently, more to imperfect understanding rather than informed decision - and that's a pity since "alter session force parallel query" can really save everyone a lot of tedious work and improve maintainability a great deal.
We will also check that issuing
alter session force parallel query parallel N;
is the same as specifying the hints
/*+ parallel (t,N) */ /*+ parallel_index (t, t_idx, N) */
for all tables referenced in the query, and for all indexes defined on them (the former is quite obvious, the latter not that much).
Side note: it is worth remembering that hinting the table for parallelism does not cascade automatically to its indexes as well - you must explicitly specify the indexes that you want to be accessed in parallel by using the separate parallel_index hint (maybe specifying "all indexes" by using the two-parameter variant "parallel_index(t,N)"). The same holds for "alter table parallel N" and "alter index parallel N", of course.
the power of "force parallel query"
I've rarely found any reason for avoiding index parallel operations nowadays - usually both the tables and their indexes are stored on disks with the same performance figures (if not the same set of disks altogether), and the cost of the initial segment checkpoint is not generally different. At the opposite, using an index can offer terrific opportunities for speeding up queries, especially when a full table scan can be substituted by a fast full scan on a (perhaps much) smaller index.
Thus, I almost always let the CBO consider index parallelism as well. Three methods can be used:
- statement hints (the most popular option)
- alter table/index parallel N
- "force parallel query".
I rather hate injecting parallel hints everywhere in my statements since it is very risky. It is far too easy to forget to specify a table or index (or simply misspell them), not to mention to forget new potentially good indexes added after the statement had been finalized. Also, you must change the statement as well even if you simply want to change the degree of parallelism, perhaps just because you are moving from an underequipped, humble and cheap test environment to a mighty production server. At the opposite, "force parallel query" is simple and elegant - just a quick command and you're done, and with a single place to touch in order to change the parallel degree.
"alter table/index parallel N" is another weak technique as well in my opinion, mainly for two reasons. The first one is that it is a permanent modification to the database objects, and after the query has finished, it is far too easy to fail to revert the objects back to their original degree setting (because of failure or coding bug). The second one is the risk of two concurrent sessions colliding on the same object that they both want to read, but with different degrees of parallelism.
Both the two problems above do not hold only when you always want to run with a fixed degree for all statements; but even in this case, I would consider issuing "force parallel query" (maybe inside a logon trigger) instead of having to set/change the degree for all tables/indexes accessed by the application.
I have noticed that many people are afraid of "force parallel query" because of the word "force", believing that it switches every statement into parallel mode. But this is not the case: as Tanel Poder recently illustrated, the phrase "force parallel query" is misleading; a better one would be something like "consider parallel query", since it is perfectly equivalent to hinting the statement for parallelism as far as I can tell (see below). And hinting itself tells the CBO to consider parallelism in addition to serial execution; the CBO is perfectly free to choose a serial execution plan if it estimates that it will cost less - as demonstrated by Jonathan Lewis years ago.
Hence there's no reason to be afraid, for example, that a nice Index Range Scan that selects just one row might turn into a massively inefficient Full Table Scan (or index Fast Full Scan) of a one million row table/index. That is true besides bugs and CBO limitations, obviously; but in these hopefully rare circumstances, one can always use the no_parallel and no_parallel_index to fix the issue.
"force parallel query" and hinting: test case
Let's show that altering the session is equivalent to hinting. I will illustrate the simplest case only - a single-table statement that can be resolved either by a full table scan or an index fast full scan (check script force_parallel_main.sql in the test case), but in the test case zip two other scenarios (a join and a subquery) are tested as well. Note: I have only checked 22.214.171.124 and 126.96.36.199 (but I would be surprised if the test case could not reproduce in 10g as well).
Table "t" has an index t_idx on column x, and hence the statement
select sum(x) from t;
can be calculated by either scanning the table or the index. In serial, the CBO chooses to scan the smaller index (costs are from 188.8.131.52):
select /* serial */ sum(x) from t; -------------------------------------- |Id|Operation |Name |Cost| -------------------------------------- | 0|SELECT STATEMENT | | 502| | 1| SORT AGGREGATE | | | | 2| INDEX FAST FULL SCAN|T_IDX| 502| --------------------------------------
If we now activate parallelism for the table, but not for the index, the CBO chooses to scan the table:
select /*+ parallel(t,20) */ sum(x) from t ------------------------------------------ |Id|Operation |Name |Cost| ------------------------------------------ | 0|SELECT STATEMENT | | 229| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 229| | 6| TABLE ACCESS FULL|T | 229| ------------------------------------------
since the cost for the parallel table access is now down from the serial cost of 4135 (check the test case logs) to the parallel cost 4135 / (0.9 * 20) = 229, thus less than the cost (502) of the serial index access.
Hinting the index as well makes the CBO apply the same scaling factor (0.9*20) to the index as well, and hence we are back to index access:
select /*+ parallel_index(t, t_idx, 20) parallel(t,20) */ sum(x) from t --------------------------------------------- |Id|Operation |Name |Cost| --------------------------------------------- | 0|SELECT STATEMENT | | 28| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 28| | 6| INDEX FAST FULL SCAN|T_IDX | 28| ---------------------------------------------
Note that the cost computation is 28 = 502 / (0.9 * 20), less than the previous one (229).
"Forcing" parallel query:
alter session force parallel query parallel 20; select /* force parallel query */ sum(x) as from t --------------------------------------------- |Id|Operation |Name |Cost| --------------------------------------------- | 0|SELECT STATEMENT | | 28| | 1| SORT AGGREGATE | | | | 2| PX COORDINATOR | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | 4| SORT AGGREGATE | | | | 5| PX BLOCK ITERATOR | | 28| | 6| INDEX FAST FULL SCAN|T_IDX | 28| ---------------------------------------------
Note that the plan is the same (including costs), as predicted.
Side note: let's verify, just for fun, that the statement can run serially even if the session is "forced" as parallel (note that I have changed the statement since the original always benefits from parallelism):
alter session force parallel query parallel 20; select /* force parallel query (with no parallel execution) */ sum(x) from t WHERE X < 0 ---------------------------------- |Id|Operation |Name |Cost| ---------------------------------- | 0|SELECT STATEMENT | | 3| | 1| SORT AGGREGATE | | | | 2| INDEX RANGE SCAN|T_IDX| 3| ----------------------------------
Side note 2: activation of parallelism for all referenced objects can be obtained, in 184.108.40.206, using the new statement-level parallel hint (check this note by Randolf Geist for details):
select /*+ parallel(20) */ sum(x) from t --------------------------------------------------- |Id|Operation |Name |Table|Cost| --------------------------------------------------- | 0|SELECT STATEMENT | | | 28| | 1| SORT AGGREGATE | | | | | 2| PX COORDINATOR | | | | | 3| PX SEND QC (RANDOM) |:TQ10000| | | | 4| SORT AGGREGATE | | | | | 5| PX BLOCK ITERATOR | | | 28| | 6| INDEX FAST FULL SCAN|T_IDX |T | 28| ---------------------------------------------------
This greatly simplifies hinting, but of course you must still edit the statement if you need to change the parallel degree.
In my last post about large pages in 220.127.116.11 I promised a little more background information on how large pages and NUMA are related.
Background and some history about processor architecture
For quite some time now the CPUs you get from AMD and Intel both are NUMA, or better: cache coherent NUMA CPUs. They all have their own “local” memory directly attached to them, in other words the memory distribution is not uniform across all CPUs. This isn’t really new, Sequent has pioneered this concept on x86 a long time ago but that’s in a different context. You really should read Scaling Oracle 8i by James Morle which has a lot of excellent content related to NUMA in it, with contributions from Kevin Closson. It doesn’t matter that it reads “8i” most of it is as relevant today as it was then.
So what is the big deal about NUMA architecture anyway? To explain NUMA and why it is important to all of us a little more background information is on order.
Some time ago processor designers and architects of industry standard hardware could no longer ignore the fact that a front side bus (FSB) proved to be a bottleneck. There were two reasons for this: it was a) too slow and b) too much data had to go over it. As one direct consequence DRAM memory has been directly attached to the CPUs. AMD has done this first with it’s Opteron processors in its AMD64 micro architecture, followed by Intel’s Nehalem micro architecture. By removing the requirement of data retrieved from DRAM to travel across a slow bus latencies could be removed.
Now imagine that every processor has a number of memory channels to which DDR3 (DDR4 could arrive soon!) SDRAM is attached to. In a dual socket system, each socket is responsible for half the memory of the system. To allow the other socket to access the corresponding other half of memory some kind of interconnect between processors is needed. Intel has opted for the Quick Path Interconnect, AMD (and IBM for p-Series) use Hyper Transport. This is (comparatively) simple when you have few sockets, up to 4 each socket can directly connect to every other without any tricks. For 8 sockets it becomes more difficult. If every socket can directly communicate with its peers the system is said to be glue-less which is beneficial. The last production glue-less system Intel released was based on the Westmere architecture. Sandy Bridge (current until approximately Q3/2013) didn’t have an eight-way glue-less variant, and this is exactly why you get Westmere-EX in the X3-8, and not Sandy Bridge as in the X3-2.
Anyway, your system will have local and remote memory. For most of us, we are not going to notice this at all since there is little point in enabling NUMA on systems with two sockets. Oracle still recommends that you only enable NUMA on 8 way systems, and this is probably the reason the oracle-validated and preinstall RPMs add “numa=off” to the kernel command line in your GRUB boot loader.
Booting with NUMA enabled
The easiest way to boot with NUMA enabled is to get to your ILOM and boot the server. As soon as the GRUB line (“booting … in x seconds”) appears, hit a key. You will be dropped into the GRUB menu. It should highlight the default boot entry (Oracle Linux Server (18.104.22.1680…x86-64). Hit the “e” key to edit the directives. You should see something like this now:
root (hd0,0) kernel /vmlinuz-2.6.39-400.xxx .... initrd /initramfs-2.6.39-400.xxx
Move the cursor to the line starting with kernel, then hit “e” again. The cursor will move to the end of the line, where you will find the numa=off directive. Hit the backspace key to remove numa=off, then hit return (it will bring you back to the previous 3 directions), then “b” to boot this configuration.
This is useful because it doesn’t involve editing the grub menu file, and if something should break you can simply restart and are back in a known good configuration.
Now when you log in as root you will notice that NUMA is turned on!
Signs of NUMA
My lab server is an AMD 6238 dual socket workstation with 32GB of RAM. To see the effect of NUMA, you can make use of the numactl tool:
[root@ol62 ~]# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 1637 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 8192 MB node 1 free: 1732 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 8192 MB node 2 free: 1800 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 8176 MB node 3 free: 1745 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10
You need to know that Opteron reports twice the number of NUMA nodes than there are sockets since their 6100 series. These processors are multi-module chips on the same die. Each of the sockets has 12 cores or better: modules. AMD’s processors are somewhere between HyperThreads and cores, to which extent I can’t tell. The server reports 24 CPUs in any case.
My configuration has allocated 12295 large pages at boot time or roughly 24 GB out of 32GB available. You can see how many pages have been allocated per CPU node in the first half of the output. Luckily the memory has been requested evenly across all NUMA nodes.
The second part of the numactl output gives you the node distances in a matrix. The numbers are provided by the Operating System at boot time in form of the System Locality Table (SLIT) and cannot be changed. They indicate the cost of accessing remote memory. 10 seems to be the base value for this parameter for local access. Higher values indicate more overhead.
NUMA in SYSFS
The SYS pseudo file system is set to replace the venerable /proc file system. The SYSFS exports more information than /proc does, which is apparent when it comes to memory allocation per NUMA node. Per node NUMA statistics are in /sys/devices/system/node*
Two files are out of interest, numastat and meminfo. I won’t go into detail for numastat (yet another post will follow), but meminfo is interesting.
[root@ol62 node0]# cat meminfo Node 0 MemTotal: 8386572 kB Node 0 MemFree: 1685988 kB Node 0 MemUsed: 6700584 kB Node 0 Active: 10516 kB Node 0 Inactive: 12704 kB Node 0 Active(anon): 2656 kB Node 0 Inactive(anon): 0 kB Node 0 Active(file): 7860 kB Node 0 Inactive(file): 12704 kB Node 0 Unevictable: 1172 kB Node 0 Mlocked: 1172 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 21276 kB Node 0 Mapped: 2960 kB Node 0 AnonPages: 3156 kB Node 0 Shmem: 116 kB Node 0 KernelStack: 1384 kB Node 0 PageTables: 528 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 WritebackTmp: 0 kB Node 0 Slab: 23788 kB Node 0 SReclaimable: 5652 kB Node 0 SUnreclaim: 18136 kB Node 0 AnonHugePages: 0 kB Node 0 HugePages_Total: 3074 Node 0 HugePages_Free: 3074 Node 0 HugePages_Surp: 0
This file is similar to /proc/meminfo but only relevant for node0, i.e. the first 6 “cores” on my system. Here you can see the large page allocation on this node.
Why does this matter
When you are consolidating lots of environments to your system with lots of sockets, you should try and stick to memory locality. Keep instances on a socket if possible, today’s servers can take a lot of memory and you shouldn’t have to use remote memory this avoiding latency. I personally would use control groups to ensure my instances stay where I want them to stay. There are other ways to control memory distribution (see some of the SLOB examples) but cgroups are by far the most elegant.
Using NUMA on your system and leaving it to chance how memory is distributed will lead to difficult-to-predict performance. You might even run out of memory on a local node causing unexpected problems. As with everything, understanding and tuning a configuration is the way to go! I will run a few benchmarks next to demonstrate the difference between local and remote memory access. Unfortunately I don’t have a 4-way system available for these tests-normally you wouldn’t really worry about NUMA settings on less than four cores.
Don’t go and rush your systems to NUMA! Like I said, there is little to be gained in about 80% of all servers out there on dual-socket systems. Four-way servers might be candidates for NUMA, 8 way are candidates. By saying candidates I mean if you understand NUMA and how it can affect your application, and have really load tested it and only if it provided to be predictable, stable performance, then I would think of enabling NUMA for a production workload. There is nothing like thorough testing that can tell you how your application will perform. I guess all I want to say is that turning on NUMA can have negative performance impact as well, or even crash your Oracle instance if the memory on a NUMA node is depleted. Search MOS for NUMA to get more information.
Thanks to everyone that submitted abstracts for our upcoming E4 conference. Unfortunately, there were more quality submissions than we had room for. Maybe next year we should expand the event to 3 days. :) But in the meantime, we have assembled what I believe is an excellent line up of speakers. I’ll just mention a few highlights here:
Tom Kyte will be doing the keynote. Enough said!
Maria Colgan and Roger MacNicol will be doing a 3 hour combined session on smart scans. Maria will attack the topic from the top down (optimizer) point of view (since she is the product manager for the optimizer) and Roger will be attacking it from the bottom up (since he is the lead developer for the smart scan code). This should be an awesome session and Tanel Poder has already said he was going to line up the night before.
Ferhat Sgonul will be talking about Turkcell’s usage of Exadata. Turkcell is one of the earliest adopters of Exadata and has had great success with it over the last several years, so this should be a very interesting case study.
Karl Arao and Tyler Muth will do a joint presentation on visualization techniques for performance data from Exadata environments. The plan is for them to compare and contrast their approaches using the same data set. Tyler usually uses R and Karl likes Tableu – may the best violin chart win.
Tyler Muth will also be doing a deep dive presentation on bloom filters and how they can be offloaded with smart scans. This is a topic about which there is little information, so it should be quite interesting.
Frits Hoogland will be doing a deep dive on how Oracle does multi-block i/o. This is of special interest with regard to Exadata because the direct path mechanism for doing multi-block i/o is a requirement for enabling smarts scans. So understanding how it works is one of the keys to getting the most out the platform.
Sue Lee (product manager for resource manager) will be doing a session on how to deal with mixed workloads. I’m really interested in this session as IORM and DBRM are critical for managing Exadata, particularly when it is used as a consolidation platform.
There are many other well known speakers including Martin Bach, Andy Colvin, Gwen Shapira, Mark Rittman, Tim Fox and Tanel Poder.
Here’s a link to see the complete line up of E4 speakers.
While we’re on the subject, I should mention that there will be several talks on hadoop related topics and the increasingly expanding role it is playing in our industry. The idea of pushing the work to the storage is not unique to Exadata. It is also the main driver behind hadoop. So I’m extremely pleased to announce that Doug Cutting will be speaking at E4 as well.
So that’s all for the marketing related stuff on E4. I hope you can join us in Dallas.
It’s stupid o’clock in the morning and I’m waiting for my taxi to arrive. Considering how close Bulgaria is, it takes me a very long time to get there.
I am a mix of excited and nervous. This is my first conference this year, so all the usual insecurities are in full effect, from fear of flying to the constant nagging thoughts that perhaps I don’t know anything about Oracle and maybe I shouldn’t be on stage acting like I do.
I’m sure it will go OK and it will be nice to meet up with the gang again.
Thanks to everyone who attended my webinar on using hints for Oracle testing and performance tuning. As usual, it was a great event and I appreciate the comments and questions.
I'll be back in the saddle again in July so keep your eyes open for the announcement of that event. Thanks again and hope to see you then!