Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

opt_estimate 5

If you’ve been wondering why I resurrected my drafts on the opt_estimate() hint, a few weeks ago I received an email containing an example of a query where a couple of opt_estimate() hints were simply not working. The critical features of the example was that the basic structure of the query was of a type that I had not previously examined. That’s actually a common type of problem when trying to investigate any Oracle feature from cold – you can spend days thinking about all the possible scenarios you should model then the first time you need to do apply your knowledge to a production system the requirement falls outside every model you’ve examined.

Before you go any further reading this note, though, I should warn you that it ends in frustration because I didn’t find a solution to the problem I wanted to fix – possibly because there just isn’t a solution, possibly because I didn’t look hard enough.

So here’s a simplified version of the problem – it involves pushing a predicate into a union all view. First some data and a baseline query:

rem
rem     Script:         opt_estimate_3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2019
rem

create table t1
as
select
        rownum                          id,
        100 * trunc(rownum/100)-1       id2,
        mod(rownum,1e3)                 n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        dual
connect by
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

create table t2a pctfree 75 as select * from t1;
create table t2b pctfree 75 as select * from t1;

create index t2ai on t2a(id);
create index t2bi on t2b(id);

explain plan for
select
        t1.v1,
        t2.flag,
        t2.v1
from
        t1,
        (select 'a' flag, t2a.* from t2a
         union all
         select 'b', t2b.* from t2b
        )       t2
where
        t2.id = t1.n1
and     t1.id = 99
/

select * from table(dbms_xplan.display(null,null,'outline alias'))
/


There is one row with t1.id = 99, and I would like the optimizer to use an indexed access path to select the one matching row from each of the two tables in the union all view. The smart execution plan would be a nested loop using a “pushed join predicate” – and that’s exactly what we get by default with this data set:


-----------------------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     2 |    96 |    30   (4)| 00:00:01 |
|   1 |  NESTED LOOPS                          |      |     2 |    96 |    30   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL                    | T1   |     1 |    19 |    26   (4)| 00:00:01 |
|   3 |   VIEW                                 |      |     1 |    29 |     4   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE          |      |       |       |            |          |
|   5 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2A  |     1 |    15 |     2   (0)| 00:00:01 |
|*  6 |      INDEX RANGE SCAN                  | T2AI |     1 |       |     1   (0)| 00:00:01 |
|   7 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2B  |     1 |    15 |     2   (0)| 00:00:01 |
|*  8 |      INDEX RANGE SCAN                  | T2BI |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SEL$1        / T1@SEL$1
   3 - SET$5715CE2E / T2@SEL$1
   4 - SET$5715CE2E
   5 - SEL$639F1A6F / T2A@SEL$2
   6 - SEL$639F1A6F / T2A@SEL$2
   7 - SEL$B01C6807 / T2B@SEL$3
   8 - SEL$B01C6807 / T2B@SEL$3

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      BATCH_TABLE_ACCESS_BY_ROWID(@"SEL$639F1A6F" "T2A"@"SEL$2")
      INDEX_RS_ASC(@"SEL$639F1A6F" "T2A"@"SEL$2" ("T2A"."ID"))
      BATCH_TABLE_ACCESS_BY_ROWID(@"SEL$B01C6807" "T2B"@"SEL$3")
      INDEX_RS_ASC(@"SEL$B01C6807" "T2B"@"SEL$3" ("T2B"."ID"))
      USE_NL(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      NO_ACCESS(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SET$1")
      OUTLINE(@"SEL$3")
      OUTLINE(@"SEL$2")
      OUTLINE_LEAF(@"SEL$1")
      PUSH_PRED(@"SEL$1" "T2"@"SEL$1" 1)
      OUTLINE_LEAF(@"SET$5715CE2E")
      OUTLINE_LEAF(@"SEL$B01C6807")
      OUTLINE_LEAF(@"SEL$639F1A6F")
      ALL_ROWS
      OPT_PARAM('_nlj_batching_enabled' 0)
      DB_VERSION('12.2.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."ID"=99)
   6 - access("T2A"."ID"="T1"."N1")
   8 - access("T2B"."ID"="T1"."N1")

So that worked well – operation 2 predicts one row for the tablescan of t1, with a nested loop join and union all pushed predicate where an index range scan of t2a_i1 and t2b_i1 gives us one row from each table. The “Predicate Information” tells us that the t1.n1 join predicate has been pushed inside the view to both subqueries so we see “t2a.id = t1.n1”, and “t2b.id = t1.n1”.

So what if I want to tell Oracle that it will actually find 5 rows in the t2a range scan and table access and 7 rows in the t2b range scan and table access (perhaps in a more complex view that would persuade Oracle to use two different indexes to get into the view and change the join order and access method for the next few tables it accessed). Since I’ve recently just written about the nlj_index_scan option for opt_estimate() you might think that this is the one we need to use – perhaps something like:


opt_estimate(@sel$639f1a6f nlj_index_scan, t2a@sel$2 (t1), t2a_i1, scale_rows=5)
opt_estimate(@sel$b01c6807 nlj_index_scan, t2b@sel$3 (t1), t2b_i1, scale_rows=7)

You’ll notice I’ve been very careful to find the fully qualified aliases for t2a and t2b by looking at the “Query Block Name / Object Alias” section of the plan (if the view appeared as a result of Oracle using Concatenation or OR-Expansion you would find that you got two query block names that looked similar but had suffixes of “_1” and “_2”). But it wasn’t worth the effort, it didn’t work. Fiddling around with all the possible variations I could think of didn’t help (maybe I should have used set$5715ce2e as the query block target for both the hints – no; what if I …)

Of course if we look at the “Outline Data” we’d notice that the use_nl() hint in the outline says: “USE_NL(@SEL$1 T2@SEL$1)”, so we don’t have a nested loop into t2a and t2b, we have a nested loop into the  view t2. So I decided to forget the nested loop idea and just go for the indexes (and the tables, when I got to them) with the following hints (you’ll notice that during the course of my experiments I added my own query block names to the initial query blocks – so the generated query block names have changed):



explain plan for
select
        /*+
                qb_name(main)
                opt_estimate(@sel$f2bf1101, index_scan, t2a@subq_a, t2ai, scale_rows=5)
                opt_estimate(@sel$f2bf1101, table,      t2a@subq_a,       scale_rows=5)
                opt_estimate(@sel$f4e7a233, index_scan, t2b@subq_b, t2bi, scale_rows=7)
                opt_estimate(@sel$f4e7a233, table,      t2b@subq_b,       scale_rows=7)
        */
        t1.v1,
        t2.flag,
        t2.v1
from
        t1,
        (select /*+ qb_name(subq_a) */ 'a' flag, t2a.* from t2a
         union all
         select /*+ qb_name(subq_b) */ 'b', t2b.* from t2b
        )       t2
where
        t2.id = t1.n1
and     t1.id = 99
;

select * from table(dbms_xplan.display(null,null,'outline alias'));


-----------------------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     2 |    96 |    30   (4)| 00:00:01 |
|   1 |  NESTED LOOPS                          |      |     2 |    96 |    30   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL                    | T1   |     1 |    19 |    26   (4)| 00:00:01 |
|   3 |   VIEW                                 |      |     1 |    29 |     4   (0)| 00:00:01 |
|   4 |    UNION ALL PUSHED PREDICATE          |      |       |       |            |          |
|   5 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2A  |     5 |    75 |     2   (0)| 00:00:01 |
|*  6 |      INDEX RANGE SCAN                  | T2AI |     5 |       |     1   (0)| 00:00:01 |
|   7 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2B  |     7 |   105 |     2   (0)| 00:00:01 |
|*  8 |      INDEX RANGE SCAN                  | T2BI |     7 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------


Excellent – we get the cardinalities we want to see for the tables – except the view operator doesn’t hold the sum of the table cardinalities, and the join doesn’t multiply up the estimates either. I couldn’t find a way of getting the view to show 12 rows (not even with a guessed – but presumably unimplemented – opt_estimate(view …) hint!), however during the course of my experiments I tried the hint: “opt_estimate(@main, table, t2@main, scale_rows=15)”. This didn’t have any visible effect in the plan but while searching through the 10053 trace file I found the following lines:

Table Stats::
  Table: from$_subquery$_002  Alias: T2  (NOT ANALYZED)
  #Rows: 20000  SSZ: 0  LGR: 0  #Blks:  37  AvgRowLen:  15.00  NEB: 0  ChainCnt:  0.00  ScanRate:  0.00  SPC: 0  RFL: 0  RNF: 0  CBK: 0  CHR: 0  KQDFLG: 9

Access path analysis for from$_subquery$_002
    >> Single Tab Card adjusted from 20000.000000 to 300000.000000 due to opt_estimate hint

Access path analysis for from$_subquery$_002
    >> Single Tab Card adjusted from 12.000000 to 180.000000 due to opt_estimate hint

So at some point in the code path the optimizer is aware that 5 + 7 = 12, and that 12 * 15 = 180. But this doesn’t show up in the final execution plan. You might notice, by the way, that the scale_rows=15 has been applied NOT ONLY to the place where I was aiming – it’s also been applied to scale up the 20,000 rows that are estimated to be in the union all to 300,000 as the estimate for a tablescan of the two tables.

Possibly if I spent more time working through the 10053 trace file (which, as I’ve said before, I try to avoid doing) I might have found exactly which code path Oracle followed to get to the plan it produced and managed to tweak some hints to get the numbers I wanted to see. Possibly the optimizer was already following the code path that actually produced the numbers I wanted, then “forgot” to use them. One day, perhaps, I’ll tale another look at the problem – but since I wasn’t trying to solve a problem for a client (and given that there was an alternative workaround) I closed the 10053 trace file and put the model aside for a rainy day.

Footnote

One thought did cross my mind as a way of finding out if there was a real solution – and I offer this for anyone who wants to play: create a second data set that genuinely produces the 5 and 7 I want to see (and, check that the view reports the sum of the two components); then run the original query against the original data so that you’ve got the execution plan in memory, overwrite the original data with the new data set (without changing the statistics on the orginal). Then use the SQL Tuning Advisor to see if it produces a SQL profile for the captured SQL_ID that reproduces the correct plan for the second data set and check what opt_estimate() hints it uses.  (Warning – this may turn into a frustrating waste of time.)

 

PFCLScan - Version 3.0

We are very excited to announce that we are currently working to have version 3.0 of PFCLScan our flagship database security scanner for the Oracle database. We will be ready for sale in September and this development is going really....[Read More]

Posted by Pete On 11/07/19 At 03:33 PM

Oracle and Microsoft’s Cross-Cloud Partnership

A couple weeks back, Oracle and Microsoft announced their cross-cloud partnership.  This was wonderful news to me, as I’ve been working on numerous Oracle projects at Microsoft with Azure.

The Gist

To know that there is now a partnership between the two clouds and that there’s also a large amount of documentation about working between the two clouds is very helpful vs. the amount I’ve been working on based off just my knowledge.  Just as anyone appreciates a second set of eyes, I now have two company’s worth!

If you missed the announcement and curious what it’s about, Oracle has forged a partnership to do cross-cloud support to Azure for the non-database tier for many of their products.  These products include:

There’s multiple support authentications for Ebiz to Azure.  The documentation linked about is for the Oracle Access Manager, but it offers you some insight to how in-depth Oracle and Microsoft have gone into making this a successful venture.

Oracle on Azure

Currently, most of my Oracle customers aren’t looking to leave Oracle-  they just want to use the Azure cloud to house the database instances.  This isn’t a difficult option, as I can build out what they need in Azure bare metal, (VMs) almost anything they require.  The list is long, (and not limited to):

  • Oracle Databases 12c-18c, with hopefully 19c to soon follow
  • Oracle Dataguard
  • Oracle Enterprise Manager
  • Oracle Golden Gate
  • Oracle Essbase
  • Oracle Hyperion
  • Oracle Data Integrator
  • APEX

I haven’t run into any databases that were too large for Azure to handle and for the one customer that was using over 1.7TB of memory, a quick AWR report granted me an opportunity to provide them with insight on how to eliminate a huge percentage of their memory, (and CPU) needs.

RAC

Yes, you can build Oracle RAC on Azure, but it requires a third party tool to support the software clustering such as FlashGrid and it’s not certified for support from Oracle.  Don’t fret though, because before ever going down this path, you should find out the business reason the customer is using RAC to begin with.  There is significant overhead to the software clustering and often the goal isn’t met by the product.

  1. High Availability-  If the nodes all reside in one datacenter, does this meet HA?
  2. Failover #1-  Many times neither the apps are able to reconnect when there is a failover
  3. Failover #2- Due to the extra resource usage by each node/cluster, a failover can cause failure and evictions of more nodes
  4. The Cloud- Azure possesses a number of HA features already built in and due to this, Oracle Dataguard will more than suffice over RAC

So this is part of my new world at Microsoft and I’ve foretold this for over a decade.  No matter what platform you choose, there is always some outlier that is mission critical that you need to manage.  With the introduction of the cloud, which creates easier access to other platforms and technology, our world just keeps getting smaller, only faster.  I’m alright with this and no, I won’t get bored this way.

 



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBAKevlar [Oracle and Microsoft's Cross-Cloud Partnership], All Right Reserved. 2019.

Decision Support Benchmark TPC-H won by #Exasol

Oops, we did it again </p />
</p></div>

    	  	<div class=

Hi Tuyen, as this removes lot of files from the Oracle Home, many features will not work.

Hi Tuyen, as this removes lot of files from the Oracle Home, many features will not work. Only what requires only the basic binaries and procedures will work.

Assumptions

Over the last few days I’ve been tweeting little extracts from Practical Oracle 8i, and one of the tweets contained the following quote:

This lead to the question:

HIGH_VALUE (and other LONG columns) revisited

Just a quick post to ensure people don’t get caught out by a recent “trick” I saw on an AskTOM question for extracting the HIGH_VALUE out of it’s LONG storage in the dictionary to a more appropriate datatype. A reviewer (I’m sure in good faith) posted the following mechanism to extract the HIGH_VALUE


SQL> create table t1 ( x timestamp )
  2  partition by range ( x )
  3  ( partition p1 values less than ( timestamp '2019-01-01 00:00:00' ),
  4    partition p2 values less than ( timestamp '2019-02-01 00:00:00' )
  5  );

Table created.

SQL>
SQL> set serveroutput on
SQL> declare
  2    v long;
  3    d date;
  4  begin
  5    for i in ( select high_value from user_tab_partitions where table_name = 'T1' )
  6    loop
  7      execute immediate 'select cast(:hv as date) from dual' into d using  i.high_value;
  8      dbms_output.put_line(d);
  9    end loop;
 10  end;
 11  /
01-JAN-19
01-FEB-19

PL/SQL procedure successfully completed.

You can see that we are casting the HIGH_VALUE as a date, and voila, out pops the values in a nice DATE datatype. That all seems well and good, but it only works for TIMESTAMP partitions and not the general case. Let’s replace our table with a DATE based partitioning scheme


SQL> drop table t1 purge;

Table dropped.

SQL> create table t1 ( x date )
  2  partition by range ( x )
  3  ( partition p1 values less than ( date '2019-01-01' ),
  4    partition p2 values less than ( date '2019-02-01' )
  5  );

Table created.

SQL>
SQL> set serveroutput on
SQL> declare
  2    d date;
  3  begin
  4    for i in ( select high_value from user_tab_partitions where table_name = 'T1' )
  5    loop
  6      execute immediate 'select cast(:hv as date) from dual' into d using i.high_value;
  7      dbms_output.put_line(d);
  8    end loop;
  9  end;
 10  /
declare
*
ERROR at line 1:
ORA-01858: a non-numeric character was found where a numeric was expected
ORA-06512: at line 6
ORA-06512: at line 6

There are plenty of methods to overcome the limitations of LONG. Here’s the one I typically use – just being a small variant on the code above


SQL> set serveroutput on
SQL> declare
  2    d date;
  3  begin
  4    for i in ( select high_value from user_tab_partitions where table_name = 'T1' )
  5    loop
  6      execute immediate 'select '||i.high_value||' from dual' into d;
  7      dbms_output.put_line(d);
  8    end loop;
  9  end;
 10  /
01-JAN-19
01-FEB-19

PL/SQL procedure successfully completed.

which I’ve also covered in video form here.

But just be careful using the CAST trick. It might cause you grief depending on the partitioning key you are using.

And if you’re wondering why the image associated with this post is that of a horse. It’s the old joke: “A horse walks into a bar, and the barman says: Why the LONG face?” </p />
</p></div>

    	  	<div class=

Sampling pg_stat_statements based on the active sessions and their associated queryid

Introduction

Now that we have the ability to sample and record the active sessions and their associated queryid with the pg_active_session_history view (see this blog post), it would be interesting to have insights about the queries statistics at the time the sessions were active.

PostgreSQL provides the queries statistics with the pg_stat_statements view. We could query the pg_active_session_history and the pg_stat_statements views and join them on the queryid field, but the queries statistics would be:

  • cumulative
  • the ones at the time we would launch this query

So it would not provide the queries statistics at the time the sessions were active.

What’s new?

To get more granular queries statistics, the pgsentinel extension has evolved so that it now samples the pg_stat_statements:

  • at the same time it is sampling the active sessions
  • only for the queryid that were associated to an active session (if any) during the sampling

The samples are recorded into a new pg_stat_statements_history view.

This view looks like:

                    View "public.pg_stat_statements_history"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 ash_time            | timestamp with time zone |           |          | 
 userid              | oid                      |           |          | 
 dbid                | oid                      |           |          | 
 queryid             | bigint                   |           |          | 
 calls               | bigint                   |           |          | 
 total_time          | double precision         |           |          | 
 rows                | bigint                   |           |          | 
 shared_blks_hit     | bigint                   |           |          | 
 shared_blks_read    | bigint                   |           |          | 
 shared_blks_dirtied | bigint                   |           |          | 
 shared_blks_written | bigint                   |           |          | 
 local_blks_hit      | bigint                   |           |          | 
 local_blks_read     | bigint                   |           |          | 
 local_blks_dirtied  | bigint                   |           |          | 
 local_blks_written  | bigint                   |           |          | 
 temp_blks_read      | bigint                   |           |          | 
 temp_blks_written   | bigint                   |           |          | 
 blk_read_time       | double precision         |           |          | 
 blk_write_time      | double precision         |           |          | 

Remarks:

  • The fields description are the same as for pg_stat_statements (except for the ash_time one, which is the time of the active session history sampling)
  • As for pg_active_sessions_history, the pg_stat_statements_history view is implemented as in-memory ring buffer where the number of samples to be kept is configurable (thanks to the pgsentinel_pgssh.max_entries parameter)
  • The data collected are still cumulative metrics but you can make use of the window functions in PostgreSQL to compute the delta between samples (and then get accurate statistics for the queries between two samples)

For example, we could get per queryid and ash_time:  the rows per second, calls per second and rows per call that way:

select ash_time,queryid,delta_rows/seconds "rows_per_second",delta_calls/seconds "calls_per_second",delta_rows/delta_calls "rows_per_call"
from(
SELECT ash_time,queryid,
EXTRACT(EPOCH FROM ash_time::timestamp) - lag (EXTRACT(EPOCH FROM ash_time::timestamp))
OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "seconds",
rows-lag(rows)
        OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "delta_rows",
calls-lag(calls)
        OVER (
        PARTITION BY pgssh.queryid
        ORDER BY ash_time
        ASC) as "delta_calls"
    FROM pg_stat_statements_history pgssh) as delta
where delta_calls > 0 and seconds > 0
order by ash_time desc;

           ash_time            |      queryid       | rows_per_second  | calls_per_second | rows_per_call
-------------------------------+--------------------+------------------+------------------+----------------
 2019-07-06 11:09:48.629218+00 | 250416904599144140 | 10322.0031121842 | 10322.0031121842 |              1
 2019-07-06 11:09:47.627184+00 | 250416904599144140 | 10331.3930170891 | 10331.3930170891 |              1
 2019-07-06 11:09:46.625383+00 | 250416904599144140 | 10257.7574710211 | 10257.7574710211 |              1
 2019-07-06 11:09:42.620219+00 | 250416904599144140 |  10296.311364551 |  10296.311364551 |              1
 2019-07-06 11:09:41.618404+00 | 250416904599144140 | 10271.6737455877 | 10271.6737455877 |              1
 2019-07-06 11:09:36.612209+00 | 250416904599144140 | 10291.1563299622 | 10291.1563299622 |              1
 2019-07-06 11:09:35.610378+00 | 250416904599144140 | 10308.9798914136 | 10308.9798914136 |              1
 2019-07-06 11:09:33.607367+00 | 250416904599144140 |  10251.230955397 |  10251.230955397 |              1
 2019-07-06 11:09:31.604193+00 | 250416904599144140 | 10284.3551339058 | 10284.3551339058 |              1
 2019-07-06 11:09:30.60238+00  | 250416904599144140 | 10277.4631222064 | 10277.4631222064 |              1
 2019-07-06 11:09:24.595353+00 | 250416904599144140 | 10283.3919912856 | 10283.3919912856 |              1
 2019-07-06 11:09:22.59222+00  | 250416904599144140 | 10271.3534800552 | 10271.3534800552 |              1
 2019-07-06 11:09:21.59021+00  | 250416904599144140 | 10300.1104655978 | 10300.1104655978 |              1
 2019-07-06 11:09:20.588376+00 | 250416904599144140 | 10343.9790974522 | 10343.9790974522 |              1
 2019-07-06 11:09:16.583341+00 | 250416904599144140 | 10276.5525289304 | 10276.5525289304 |              1
  • it’s a good practice to normalize the statistics per second (as the sampling interval might change)
  • With that level of information we can understand the database activity in the past (thanks to the active sessions sampling) and get statistics per query at the time they were active
  • pgsentinel is available in this github repository

Conclusion

The pgsentinel extension now provides:

  • Active session history (through the pg_active_session_history view)
  • Queries statistics history (through the pg_stat_statements_history view), recorded at the exact same time as their associated active sessions

strace -k (build with libunwind)

strace -k (built with libunwind)

prints the stack trace with the system calls

PostgreSQL is Open Source and you may think that it is not necessary to trace the system calls as we can read the source code. But even there, strace is a really nice tool for troubleshooting a running process.

https://postgreslondon.org/speaker/dmitrii-dolgov/

Little disclaimer here: attaching strace to a running process may hang. Do not use it in production, except when this (small) risk is an acceptable way to troubleshoot a critical problem.

At PostgresLondon 2019, attending Dmitry Dolgov session “PostgreSQL at low level: stay curious!” I learned something about strace: the -k argument displays the stack trace. That’s awesome: it can display the full stack of C functions in the software at the moment the system call is made. An easy way to get the context of the system call, and understand the reason for the call.

With Oracle, we are used to see the stack trace in dumps, but of course, I immediately wanted to test this strace -k:

$ strace -k
strace: invalid option -- 'k'
Try 'strace -h' for more information.

Ok, not available here (OEL7.6) but man strace gives more information:

-k Print the execution stack trace of the traced processes after each system call (experimental). This option is available only if strace is built with libunwind.

Ok, no, problem, let’s build strace with libunwind

This is what I did, as root, to replace my current strace with the new one:

yum install -y autoconf automake libunwind libunwind-devel
cd /var/tmp
git clone https://github.com/strace/strace.git
cd strace
./bootstrap
./configure --with-libunwind
make
make install

The man strace is also updated:

-k Print the execution stack trace of the traced processes after each system call.

REady to test. Here is an example where on the Oracle Log Writer for writes:

strace -k -e trace=desc -y -p $(pgrep -f ora_lgwr_CDB1A)

When I have a stack from the oracle executable, I format it as we are used to see in Oracle dumps short stacks:

strace -k -e trace=desc -y -p $(pgrep -f ora_lgwr_CDB1A) 2>&1 | 
awk '/^ > /{gsub(/[()]/," ");sub(/+0x/,"()&");printf "<-"$3;next}{sub(/[(].*/,"()");printf "\n\n"$0}'

and then I can paste it in Frits Hoogland http://orafun.info/stack/ to add some annotations about the Oracle C functions:

http://orafun.info/stack/ created by Frits Hoogland with a little help from Kamil Stawiarski

Again, there is always a risk to attach strace to a running process, and there may be some reasons why the Linux distributions do not build it with libunwind so be careful: Labs, non-critical environment. Or critical ones with a blocking issue to troubleshoot justifies the risk.

Getting rid of annoying, repetitive messages in /var/log/messages

The primary source of information regarding any change or issue on a linux system is the /var/log/messages file. I am often annoyed when a linux system is setup in such a way that certain messages are written to syslog with a high frequency swamping the messages file with information that is not important. The reason for my annoyance is that this makes it very hard to actually spot important information because you have to skip through a lot of lines before you find the important information, especially if you do not know for sure if there a message in the first place.

Please mind this blogpost is created on a Centos 7 server which uses rsyslog.

There are a couple of ways to manage this. The standard syslog way of managing this is the following, which can be found in /etc/rsyslog.conf:

#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 :omusrmsg:*

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log

What this shows is that certain principal important tasks have their own ‘facility’, which in this example are ‘mail’, ‘cron’, ‘authpriv’, etc. shown above, which are put in their own file.

But how about processes that do not have their own facility, and just write to syslog? One example of these this is dhclient, which produces messages like these:

Jun 30 03:28:32 ip-172-31-12-40 dhclient[19604]: DHCPREQUEST on eth0 to 172.31.0.1 port 67 (xid=0x5dfd88c4)
Jun 30 03:28:32 ip-172-31-12-40 dhclient[19604]: DHCPACK from 172.31.0.1 (xid=0x5dfd88c4)
Jun 30 03:28:35 ip-172-31-12-40 dhclient[19604]: bound to 172.31.12.40 -- renewal in 1726 seconds.
Jun 30 03:57:21 ip-172-31-12-40 dhclient[19604]: DHCPREQUEST on eth0 to 172.31.0.1 port 67 (xid=0x5dfd88c4)
Jun 30 03:57:21 ip-172-31-12-40 dhclient[19604]: DHCPACK from 172.31.0.1 (xid=0x5dfd88c4)
Jun 30 03:57:23 ip-172-31-12-40 dhclient[19604]: bound to 172.31.12.40 -- renewal in 1798 seconds.

Which it does every 5 minutes. This is actually truly annoying…

Luckily, there is a solution: rsyslog has the option to filter based on more properties than the logging facility a process is using. This is done using a script in /etc/rsyslog.d/.

On my server, the majority of the messages are from the dhclient and systemd daemons, and both the messages seem to be informal. In order not to miss anything, I still want that information to be logged, but not in the /var/log/messages file.

This can be actually quite simply be accomplished using the following two scripts in /etc/rsyslog.d/:

dhclient.conf:

if $programname == 'dhclient' then /var/log/dhclient.log
&stop

systemd.conf

if $programname = 'systemd' then /var/log/systemd.log
&stop

Once you created these scripts, you need to make rsyslogd read this new configuration. I thought killall -HUP rsyslogd would accomplish this, but outside of a message in the /var/log/messages file saying it got the HUP signal, it doesn’t execute a new task.

However, executing:

systemctl stop rsyslog.service
systemctl start rsyslog.service

Does make rsyslog read the new configuration, and then both dhclient and systemd log to their own files and do not write to the messages file anymore!

There is one last thing that needs to be done: make sure the newly defined logfiles are cleaned just like all the other files in /var/log. Otherwise these files will endlessly grow, eventually occupying all the space in the filesystem where /var/log is part of.

This too is really easy, the newly defined logfiles can be added to the list of syslog files for logrotate, which is defined in /etc/logrotate.d/syslog:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
/var/log/dhclient.log
/var/log/systemd.log
{
    missingok
    sharedscripts
    postrotate
	/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    endscript
}

As you can see, I simply added /var/log/dhclient.log and /var/log/systemd.log to the list.

Please mind that the filter for dhclient and systemd is the executable name, so even if the severity of the logging message is high, a message from these daemons will still go to the file it is configured to log to.