Recently I was looking in the trace directory in the diag dest of my (12.2.0.1) instance, and found some trace files which were different from the others:
$ ls | grep dia0 test_dia0_9711_base_1.trc test_dia0_9711_base_1.trm test_dia0_9711_lws_1.trc test_dia0_9711_lws_1.trm test_dia0_9711.trc test_dia0_9711.trm
The dia0 ‘test_dia0_9711.trc’ file is fully expected. But what are these ‘..lws_1.trc’ and ‘..base_1.trc’ files? And ‘base’ is something that I understand, but what would ‘lws’ mean? Lunatics Without Skateboards?
First, let’s look into the normally named trace file of the dia0 process:
$ cat test_dia0_9711.trc HM: Early Warning and/or Long Waiting Session (LWS) is enabled. The output can be found in trace files of the form: test_dia0_9711_lws_n.trc where n is a number between 1 and 5. *** 2017-11-26T12:18:25.849759+00:00 HM: Moving the default trace file to: test_dia0_9711_base_1.trc *** 2017-11-26T16:45:33.193729+00:00 HM: Tracing moved from base trace file: test_dia0_9711_base_1.trc HM: This is a single non-RAC instance. Hang detection is enabled. HM: Hang resolution scope is OFF. HM: Cross Boundary hang detection is enabled. HM: Cross Cluster hang detection is enabled. HM: CPU load is not being monitored. HM: IO load is NOT abnormally high - 0% outliers. HM: Tracing is now being performed to this file. 2017-11-26 16:45:33.193 :kjzdshn(): Shutting down DIAG
Ah! So LWS means ‘Long Waiting Sessions’. This seems to be functionality that produces trace files for execution which do not perform optimally.
Let’s look into the ‘base’ trace file:
*** 2017-11-26T12:18:25.849773+00:00 HM: Tracing moved from default trace file: test_dia0_9711.trc HM: This is a single non-RAC instance. Hang detection is enabled. HM: Hang resolution scope is OFF. HM: Cross Boundary hang detection is enabled. HM: Cross Cluster hang detection is enabled. HM: CPU load is not being monitored. HM: IO load is NOT abnormally high - 0% outliers. *** 2017-11-26T12:18:25.849903+00:00 HM: Early Warning and/or Long Waiting Session (LWS) is enabled. The output can be found in trace files of the form: test_dia0_9711_lws_n.trc where n is a number between 1 and 5. *** 2017-11-26T12:18:30.870494+00:00 HM: Detected incompatible setting 'INSTANCE' for Hang Resolution Scope - disabling hang resolution.
I get the feeling this is originally made for RAC, but now also working for single instance. Some RAC related features seem to be turned off. It seems that the actual session information is stored in the ‘lws’ trace files.
This is a sample of what is in a lws trace file:
*** 2017-11-26T13:14:25.053171+00:00 HM: Early Warning - Session ID 58 serial# 53774 OS PID 30760 (FG) is 'not in a wait' for 33 seconds last waited on 'resmgr:cpu quantum' p1: 'location'=0x3, p2: 'consumer group id'=0x4a2f, p3: ' '=0x0 IO Total Self- Total Total Outlr Outlr Outlr Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait Sess Hangs Hangs Count Secs Count Secs Count Wait Event ------ ------ ------ ------ ------ ------ ------ ------ ----------- 1 0 0 0 0 0 0 0 not in wait HM: Dumping Short Stack of pid[71.30760] (sid:58, ser#:53774) Short stack dump: ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-sslnxadd()+38<-pfrinstr_ADDN()+103<-pfrrun_no_tool()+60<-pfrrun()+919<-plsql_run()+756<-peicnt()+288<-kkxexe()+717<-opiexe()+22531<-kpoal8()+2679<-opiodr()+1229<-ttcpip()+1257<-opitsk()+1940<-opiino()+941<-opiodr()+1229<-opidrv()+1021<-sou2o()+145<-opimai_real()+455<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245 HM: Current SQL: declare x number := 0; begin for c in 1..1000000000 loop x := x + c; end loop; end;
I got a sample PLSQL loop running for testing load (which is really silly), and this running for 33 seconds apparently generates an ‘Early Warning’. It shows what it was doing from a wait based analysis perspective (‘not in a wait’, ‘resmgr: cpu quantum’), it shows some statistics for the hang analyser, it dumps a short stack, which is fully expected; kdsests() to __sighandler() are the routines for generating the stack dump, the actual function that was executing was sslnxadd(), which is a system specific linux add function, which is called by pfrinstr_ADDN(), which is the PLSQL add function, which is called by pfrrun_no_tool(), which is the PLSQL fast interpretation loop function, and so on. The last thing it dumps is the current SQL, which is the silly PLSQL code.
Is there anything inside the Oracle database that shows anything related to this? Yes, it seems there are:
V$HANG_INFO V$HANG_SESSION_INFO V$HANG_STATISTICS DBA_HANG_MANAGER_PARAMETERS
Doing some simple tests, I could not get V$HANG_INFO and V$HANG_SESSION_INFO populated. V$HANG_STATISTICS contains 87 statistics, and displays current data, of which most in my single instance test database are 0.
One thing that confused me at first was the abbreviation ‘HM’ for hang monitor, because the very same abbreviation is used in the diagnostic repository for ‘health monitor’.
It seems the hang manager is a quite extensive framework for detecting hangs and hang related issues; I peeked at the documented and undocumented parameters containing “hang” in them, of which there are quite an impressive amount quite likely belonging to the hang manager framework:
SQL> @parms Enter value for parameter: hang Enter value for isset: Enter value for show_hidden: Y _global_hang_analysis_interval_secs 10 TRUE FALSE FALSE _hang_allow_resolution_on_single_nonrac FALSE TRUE FALSE FALSE _hang_analysis_num_call_stacks 3 TRUE FALSE FALSE _hang_base_file_count 5 TRUE FALSE FALSE _hang_base_file_space_limit 100000000 TRUE FALSE FALSE _hang_blocked_session_percent_threshold 10 TRUE FALSE FALSE _hang_bool_spare1 TRUE TRUE FALSE FALSE NAME VALUE ISDEFAUL ISMODIFIED ISSET -------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ---------- _hang_bool_spare2 TRUE TRUE FALSE FALSE _hang_cross_boundary_hang_detection_enabled TRUE TRUE FALSE FALSE _hang_cross_cluster_hang_detection_enabled TRUE TRUE FALSE FALSE _hang_delay_resolution_for_libcache TRUE TRUE FALSE FALSE _hang_detection_enabled TRUE TRUE FALSE FALSE _hang_enable_processstate TRUE TRUE FALSE FALSE _hang_hang_analyze_output_hang_chains TRUE TRUE FALSE FALSE _hang_hiload_promoted_ignored_hang_count 2 TRUE FALSE FALSE _hang_hiprior_session_attribute_list TRUE FALSE FALSE _hang_hung_session_ewarn_percent 34 TRUE FALSE FALSE _hang_ignored_hang_count 1 TRUE FALSE FALSE NAME VALUE ISDEFAUL ISMODIFIED ISSET -------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ---------- _hang_ignored_hangs_interval 300 TRUE FALSE FALSE _hang_ignore_hngmtrc_count 1 TRUE FALSE FALSE _hang_int_spare1 0 TRUE FALSE FALSE _hang_int_spare2 0 TRUE FALSE FALSE _hang_log_io_hung_sessions_to_alert FALSE TRUE FALSE FALSE _hang_log_verified_hangs_to_alert FALSE TRUE FALSE FALSE _hang_long_wait_time_threshold 0 TRUE FALSE FALSE _hang_lws_file_count 5 TRUE FALSE FALSE _hang_lws_file_space_limit 100000000 TRUE FALSE FALSE _hang_max_session_hang_time 96 TRUE FALSE FALSE _hang_monitor_archiving_related_hang_interval 300 TRUE FALSE FALSE NAME VALUE ISDEFAUL ISMODIFIED ISSET -------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ---------- _hang_msg_checksum_enabled TRUE TRUE FALSE FALSE _hang_resolution_allow_archiving_issue_termination TRUE TRUE FALSE FALSE _hang_resolution_confidence_promotion FALSE TRUE FALSE FALSE _hang_resolution_global_hang_confidence_promotion FALSE TRUE FALSE FALSE _hang_resolution_percent_hung_sessions_threshold 300 TRUE FALSE FALSE _hang_resolution_policy HIGH TRUE FALSE FALSE _hang_resolution_promote_process_termination FALSE TRUE FALSE FALSE _hang_resolution_scope OFF TRUE FALSE FALSE _hang_running_in_lrg FALSE TRUE FALSE FALSE _hang_short_stacks_output_enabled TRUE TRUE FALSE FALSE _hang_signature_list_match_output_frequency 10 TRUE FALSE FALSE NAME VALUE ISDEFAUL ISMODIFIED ISSET -------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ---------- _hang_statistics_collection_interval 15 TRUE FALSE FALSE _hang_statistics_collection_ma_alpha 30 TRUE FALSE FALSE _hang_statistics_high_io_percentage_threshold 25 TRUE FALSE FALSE _hang_terminate_session_replay_enabled TRUE TRUE FALSE FALSE _hang_trace_interval 32 TRUE FALSE FALSE _hang_verification_interval 46 TRUE FALSE FALSE
For the sake of experimenting with the hang manager I locked a table using ‘lock table t2 in exclusive mode’ and another session doing an insert. The ‘base’ trace file showed the following information:
*** 2017-12-11T06:25:10.364037+00:00 HM: Hung Sessions (local detect) - output local chains =============================================================================== Non-intersecting chains: ------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (test.test) os id: 7526 process id: 38, oracle@memory-presentation.local (TNS V1-V3) session id: 27 session serial #: 32599 } is waiting for 'enq: TM - contention' with wait info: { p1: 'name|mode'=0x544d0003 p2: 'object #'=0x57e6 p3: 'table/partition'=0x0 time in wait: 1 min 35 sec timeout after: never wait id: 74 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.000123 sec 1. event: 'SQL*Net message from client' time waited: 26.493425 sec wait id: 73 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #1 and #2: 0.000025 sec 2. event: 'SQL*Net message to client' time waited: 0.000002 sec wait id: 72 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #2 and #3: 0.000005 sec 3. event: 'SQL*Net break/reset to client' time waited: 0.000055 sec wait id: 71 p1: 'driver id'=0x62657100 p2: 'break?'=0x0 } and is blocked by => Oracle session identified by: { instance: 1 (test.test) os id: 7481 process id: 24, oracle@memory-presentation.local (TNS V1-V3) session id: 55 session serial #: 9654 } which is waiting for 'SQL*Net message from client' with wait info: { p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 time in wait: 1 min 39 sec timeout after: never wait id: 6714 blocking: 1 session wait history: * time between current wait and wait #1: 0.000006 sec 1. event: 'SQL*Net message to client' time waited: 0.000004 sec wait id: 6713 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #1 and #2: 0.000203 sec 2. event: 'SQL*Net message from client' time waited: 17.115734 sec wait id: 6712 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 * time between wait #2 and #3: 0.000007 sec 3. event: 'SQL*Net message to client' time waited: 0.000004 sec wait id: 6711 p1: 'driver id'=0x62657100 p2: '#bytes'=0x1 } Chain 1 Signature: 'SQL*Net message from client'<='enq: TM - contention' Chain 1 Signature Hash: 0x163c4cba ------------------------------------------------------------------------------- ===============================================================================
If you wait a little longer, there is another type of dia0 trace file that appears: ‘..vfy_1.trc’, which is a file containing ‘verified hangs’. They appear one time for a given hang situation, and looks like this:
One or more possible hangs have been detected on your system. These could be genuine hangs in which no further progress will be made without intervention, or it may be very slow progress in the system due to high load. Previously detected and output hangs are not displayed again. Instead, the 'Victim Information' section will indicate that the victim is from an 'Existing Hang' under the 'Existing or' column. 'Verified Hangs' below indicate one or more hangs that were found and identify the final blocking session and instance on which they occurred. Since the current hang resolution state is 'OFF', there will be no attempt to resolve any of these hangs unless hang resolution is enabled. Terminating this session may well free up the hang. Be aware that if the column named 'Fatal BG' is marked with 'Y', terminating the process will cause the instance to crash. Any hangs with a 'Hang Resolution Action' of 'Unresolvable' will be ignored. These types of hangs will either be resolved by another layer in the RDBMS or cannot be resolved as they may require external intervention. Deadlocks (also named cycles) are currently NOT resolved even if hang resolution is enabled. The 'Hang Type' of DLCK in the 'Verified Hangs' output identifies these hangs. Below that are the complete hang chains from the time the hang was detected. The following information will assist Oracle Support Services in further analysis of the root cause of the hang. *** 2017-12-11T07:05:07.989085+00:00 HM: Hang Statistics - only statistics with non-zero values are listed number of hangs detected 1 number of local hangs 1 hangs ignored due to application contention 1 hangs:total number of impacted sessions 2 hangs:minimum number of impacted sessions 2 hangs:maximum number of impacted sessions 2 current number of hangs 1 current number of active sessions 1 current number of hung sessions 2 *** 2017-12-11T07:05:07.989122+00:00 Verified Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 1 HANG VICSELTD 1 58 2 2 LOW LOCAL IGNRD:HngRslnScpOFF Hang Ignored Reason: Hang resolution is disabled. Inst Sess Ser Proc Wait Wait Num ID Num OSPID Name Time(s) Event ----- ------ ----- --------- ----- ------- ----- 1 45 5781 27709 FG 137 enq: TM - contention 1 58 15676 27700 FG 146 SQL*Net message from client Wait event statistics for event 'SQL*Net message from client' Self-Resolved Hang Count : 0 Self-Resolved Hang Total Hang Time in seconds : 0 Self-Resolved Hang Minimum Hang Time in seconds : 0 Self-Resolved Hang Maximum Hang Time in seconds : 0 Resolved Hang Count : 0 Total Hung Session Count : 1 Total Wait Time in micro-seconds : 59799468 Total Wait Count : 67 Moving Average of Wait Time in micro-seconds : 5220981 Victim Information Existing or Ignored HangID Inst# Sessid Ser Num OSPID Fatal BG New Hang Count ------ ----- ------ ------- --------- -------- ------------- ------- 1 1 58 15676 27700 N New Hang 1
It appears Oracle adds more and more useful information in trace files, which only can be fetched from the trace files. Yet it doesn’t seem that any monitoring is actually picking this up, which, to my knowledge includes Oracle’s own Enterprise Manager.
The hang manager tracing talks about outliers, upon grepping through the dia0 trace files I found an explanation what dia0 is considering an outlier:
*** 2017-11-26T16:45:33.193479+00:00 HM: Hang Statistics - only statistics with non-zero values are listed current number of active sessions 1 instance health (in terms of hung sessions) 100.00% Dumping Wait Events with non-zero hung session counts or non-zero outliers wait counts (NOTE: IO outliers have wait times greater than 32 ms. non-IO outliers have wait times greater than 64 s. IDLE wait events are not output.) IO Total Self- Total Total Outlr Outlr Outlr Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait Sess Hangs Hangs Count Secs Count Secs Count Wait Event ------ ------ ------ ------ ------ ------ ------ ------ ----------- 0 0 0 1366 2 0 0 1 log file parallel write 1 0 0 2206 4294966594 1 4294966272 0 resmgr:cpu quantum 0 0 0 1 1 0 0 1 external table read 1 0 0 0 0 0 0 0 not in wait *** 2017-11-26T16:45:33.193686+00:00 TotalCPUTm:4230 ms TotalRunTm:16027340 ms %CPU: 0.03%
Full name
Frits Hoogland
My company
http://www.enkitec.com
Recent comments
12 weeks 4 days ago
24 weeks 5 days ago
29 weeks 1 day ago
29 weeks 6 days ago
34 weeks 3 days ago
1 year 3 weeks ago
1 year 23 weeks ago
2 years 1 week ago
2 years 37 weeks ago
2 years 38 weeks ago