Search

OakieTags

Who's online

There are currently 0 users and 25 guests online.

Recent comments

New Events for Data Guard and Synchronous Redo Transport in 12c (1)

I may have said it before but I consider presenting and teaching a great way to expand one’s knowledge: first of all it requires me to really understand a subject. Secondly, when presenting, you get lots of interesting questions that can turn into blog posts like this one.

Lately I have been asked about the impact of synchronous log shipping to a physical standby database. I was sure there was an effect to be observed, depending most likely on the network latency between systems but I didn’t have any evidence I could pull out of the hat to back up my thoughts. So what better than trying! I also read that some of the events have changed in 12c, and wanted to make them visible. My environment is based on the 2 node RAC primary/2 node RAC standby configuration I wrote about in my previous posts.

Since their initial setup I upgraded the cluster to 12.1.0.2.170117 for Clusterware and RDBMS.

I am planning to share my findings in two posts: the first one (the one you are reading) will present the baseline, the second one to follow shows how performance can be impacted by poor network performance. I guess the second one will be a bit longer …

Background

Data Guard allows the administrator to define the databases in a Data Guard configuration to be in one of three possible modes. The first one is Maximum Performance, the second Maximum Availability and finally Maximum Protection. All three modes are covered in the official documentation set and MOS. Without saying it very scientifically:

  • Maximum Performance uses asynchronous log shipping and is the default. The performance of the standby database (or rather lack thereof) should not impact the performance of the primary. At a cost: it is not guaranteed that redo is actually written to the standby when the commit command returns.
  • Maximum Protection allows you to enforce a Zero Data Loss (ZDL) strategy but also at a cost: The primary will shut down if it can’t transmit redo information to the standby. That is how you prevent data loss!
  • Maximum Availability is the middle ground and I’ll focus on that protection mode here.

Make sure to read the documentation and understand the implications of each protection mode when assessing the most suitable mode for your environment!

Background: New Events

Before going into details I should probably point out that there are some new Data Guard events. Hurray! More events and more meaningful names, great. Just one caveat-have a look at the output from my 12c database:

select name, wait_class from v$event_name
where name in (
'ARCH Remote Write','ASYNC Remote Write','Redo Transport Attach',
'Redo Transport Close','Redo Transport Detach','Redo Transport Open',
'Redo Transport Ping','Redo Transport Slave Shutdown','Redo Transport Slave Startup',
'Redo Writer Remote Sync Complete', 'Redo Writer Remote Sync Notify',
'Remote SYNC Ping','SYNC Remote Write');

NAME                                               WAIT_CLASS
-------------------------------------------------- ----------------------------------------
ASYNC Remote Write                                 Other
SYNC Remote Write                                  Other
ARCH Remote Write                                  Other
Redo Transport Attach                              Other
Redo Transport Detach                              Other
Redo Transport Open                                Other
Redo Transport Close                               Other
Redo Transport Ping                                Other
Remote SYNC Ping                                   Other
Redo Transport Slave Startup                       Other
Redo Transport Slave Shutdown                      Other
Redo Writer Remote Sync Notify                     Other
Redo Writer Remote Sync Complete                   Other

13 rows selected.

Compare this to the pre 12c events (taken from an 11.2.0.3 database since they are mostly gone in 12c)

select name, wait_class from v$event_name
where name in (
'ARCH wait on ATTACH','ARCH wait on SENDREQ','ARCH wait on DETACH',
'LNS wait on ATTACH', 'LNS wait on SENDREQ','LNS wait on DETACH',
'LGWR wait on LNS','LNS wait on LGWR','LGWR-LNS wait on channel');

NAME                                               WAIT_CLASS
-------------------------------------------------- --------------------
ARCH wait on ATTACH                                Network
LNS wait on ATTACH                                 Network
LNS wait on SENDREQ                                Network
LNS wait on DETACH                                 Network
ARCH wait on SENDREQ                               Network
ARCH wait on DETACH                                Network
LNS wait on LGWR                                   Network
LGWR wait on LNS                                   Network
LGWR-LNS wait on channel                           Other

9 rows selected.

For the sake of completeness here are those events that remained in 12c:

select name, wait_class from v$event_name
where name in (
'ARCH wait on ATTACH','ARCH wait on SENDREQ','ARCH wait on DETACH',
'LNS wait on ATTACH', 'LNS wait on SENDREQ','LNS wait on DETACH',
'LGWR wait on LNS','LNS wait on LGWR','LGWR-LNS wait on channel');

NAME                           WAIT_CLASS
------------------------------ ------------------------------
LNS wait on LGWR               Network
LGWR wait on LNS               Network
LGWR-LNS wait on channel       Other

Bugger. The new events are all in the “Other” wait class. My initial attempts at filtering information from v$event_histogram based on events with the “Network” class had little success. I guess friends of the OEM performance pages might also find that interesting. If you wonder from where I pulled these events – I used a white paper named “Best Practices for Synchronous Redo Transport” as reference.

Testing: The Environment

As you read in the introduction, this environment is based on a two node RAC primary -> two node RAC standby configuration:

DGMGRL> show configuration verbose

Configuration - ractest

  Protection Mode: MaxPerformance
  Members:
  NCDBA - Primary database
    NCDBB - Physical standby database

  Properties:
    FastStartFailoverThreshold      = '30'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS
DGMGRL> show database 'NCDBA';

Database - NCDBA

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    NCDBA1
    NCDBA2

Database Status:
SUCCESS

DGMGRL> show database 'NCDBB'

Database - NCDBB

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          0 seconds (computed 0 seconds ago)
  Average Apply Rate: 177.00 KByte/s
  Real Time Query:    OFF
  Instance(s):
    NCDBB1
    NCDBB2 (apply instance)

Database Status:
SUCCESS

DGMGRL> show database 'NCDBA' LogXptMode
  LogXptMode = 'async'
DGMGRL> show database 'NCDBB' LogXptMode
  LogXptMode = 'async'

The configuration is currently set to “Maximum Performance”, databases ship redo asynchronously.

All systems are patched to the January 2017 Proactive Bundle Patch:

[oracle@rac12pri1 ~]$ $ORACLE_HOME/OPatch/opatch lspatches
24917972;Database PSU 12.1.0.2.170117, Oracle JavaVM Component (JAN2017)
24732082;Database Patch Set Update : 12.1.0.2.170117 (24732082)
24828633;OCW Patch Set Update : 12.1.0.2.170117 (24828633)

OPatch succeeded.

Testing part I: The Baseline

I am using Swingbench’s Order Entry benchmark to generate a little bit of load. For the avoidance of doubt: I am not interested in a comparison between asynchronous and synchronous log shipping. I would like to see the effect caused by the difference in latency in the form of hugely increased waits. In my first test I am running with the defaults of my VM (the NIC uses virtio drivers), no change to SDU or send/receive buffers.

$ date
Fri Feb 17 13:58:58 UTC 2017
$ ./charbench -cs //rac12pri-scan/swingbench_both \
> -uc 20 -r results_20_maxperf.xml -rt 00:10
Author  :        Dominic Giles
Version :        2.5.0.971

Results will be written to results_20_maxperf.xml.
Hit Return to Terminate Run...

Time            Users   TPM     TPS

1:59:55 PM      20      3852    255

On the apply side this translates to the following figures (the average apply rate is misleading and should be ignored; it captures all apply since MRP0 was started)

DGMGRL> show database 'NCDBB'

Database - NCDBB

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          0 seconds (computed 0 seconds ago)
  Average Apply Rate: 713.00 KByte/s
  Real Time Query:    OFF
  Instance(s):
    NCDBB1
    NCDBB2 (apply instance)

Database Status:
SUCCESS

There is no transport lag, and no apply lag either. When checking for waits on background processes (using Tanel Poder’s ashtop – which requires you to have the Diagnostic Pack license as it access v$active_session_history) I noticed the following:

SQL> @ashtop program,event,wait_class,session_state session_type='BACKGROUND' "to_date('17.02.2017 09:02','dd.mm.yyyy hh24:mi')" "to_date('17.02.2017 09:07','dd.mm.yyyy hh24:mi')"

    Total
  Seconds     AAS %This   PROGRAM                                          EVENT                                    WAIT_CLASS      SESSION FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------------------------------------------ ---------------------------------------- --------------- ------- ------------------- ------------------- -----------------
       53      .2   29% | oracle@rac12pri2 (LGWR)                          log file parallel write                  System I/O      WAITING 2017-02-17 09:03:42 2017-02-17 09:06:53                 1
       23      .1   13% | oracle@rac12pri2 (LG00)                          log file parallel write                  System I/O      WAITING 2017-02-17 09:02:03 2017-02-17 09:03:26                 1
       19      .1   10% | oracle@rac12pri2 (DBW0)                                                                                   ON CPU  2017-02-17 09:02:07 2017-02-17 09:06:39                 1
       12      .0    7% | oracle@rac12pri2 (LGWR)                          target log write size                    Other           WAITING 2017-02-17 09:02:03 2017-02-17 09:03:27                 1
        9      .0    5% | oracle@rac12pri1 (LMS1)                                                                                   ON CPU  2017-02-17 09:04:37 2017-02-17 09:06:58                 1
        8      .0    4% | oracle@rac12pri2 (TT00)                                                                                   ON CPU  2017-02-17 09:02:15 2017-02-17 09:06:56                 1
        6      .0    3% | oracle@rac12pri1 (LG00)                          log file parallel write                  System I/O      WAITING 2017-02-17 09:03:08 2017-02-17 09:06:44                 1
        6      .0    3% | oracle@rac12pri2 (LGWR)                                                                                   ON CPU  2017-02-17 09:04:05 2017-02-17 09:06:41                 1
        5      .0    3% | oracle@rac12pri2 (LG00)                                                                                   ON CPU  2017-02-17 09:02:02 2017-02-17 09:03:27                 1
        5      .0    3% | oracle@rac12pri2 (LMS0)                                                                                   ON CPU  2017-02-17 09:02:07 2017-02-17 09:06:14                 1
        5      .0    3% | oracle@rac12pri2 (LMS1)                                                                                   ON CPU  2017-02-17 09:02:24 2017-02-17 09:06:28                 1
        4      .0    2% | oracle@rac12pri1 (LMS0)                                                                                   ON CPU  2017-02-17 09:04:21 2017-02-17 09:06:05                 1
        3      .0    2% | oracle@rac12pri1 (LGWR)                          LGWR all worker groups                   Other           WAITING 2017-02-17 09:03:08 2017-02-17 09:06:13                 1
        2      .0    1% | oracle@rac12pri1 (DBW0)                          db file parallel write                   System I/O      WAITING 2017-02-17 09:03:08 2017-02-17 09:03:42                 1
        2      .0    1% | oracle@rac12pri2 (CKPT)                                                                                   ON CPU  2017-02-17 09:05:10 2017-02-17 09:05:13                 1
        2      .0    1% | oracle@rac12pri2 (TT00)                          log file sequential read                 System I/O      WAITING 2017-02-17 09:03:17 2017-02-17 09:03:46                 1
        1      .0    1% | oracle@rac12pri1 (CJQ0)                                                                                   ON CPU  2017-02-17 09:04:44 2017-02-17 09:04:44                 1
        1      .0    1% | oracle@rac12pri1 (DBW0)                                                                                   ON CPU  2017-02-17 09:03:05 2017-02-17 09:03:05                 1
        1      .0    1% | oracle@rac12pri1 (LG00)                                                                                   ON CPU  2017-02-17 09:02:02 2017-02-17 09:02:02                 1
        1      .0    1% | oracle@rac12pri1 (LGWR)                          target log write size                    Other           WAITING 2017-02-17 09:05:47 2017-02-17 09:05:47                 1

The time difference between charbench and the database can be explained by time zones: my load generator is set to use UTC whereas the database is on EST; the result is a 6 hour time difference. This is a minor oversight on my part and has nothing to do with the location of the machines. In fact, they are all virtualised.

Summary Part I

In this part I tried to lay the foundation for the next one where I’m using my lab environment to simulate an artificially high network latency. I have to admit again that my setup is somewhat over-simplified-using Data Guard with asynchronous log shipping and Maximum Performance mode is not something I’d like to support for anything more important than my toy databases in the lab.

References

http://www.oracle.com/technetwork/database/availability/sync-2437177.pdf