Search

OakieTags

Who's online

There are currently 0 users and 33 guests online.

Recent comments

Affiliations

May 2011

Row Values to Comma Separated Lists, an Overly Complicated Use Case Example

May 26, 2011 In a previous article I showed a simple method to convert values found in multiple table rows into a comma delimited list.  The method works very well, until the situation in which the approach is to be used becomes a bit more complex. Assume for a moment that the following table structure [...]

How to use the RAC FAN Java API

I researched an interesting new feature available with Oracle 11g R2, the so called RAC FAN API when writing the workload management chapter for the RAC book. The RAC FAN API is documented in Oracle® Database JDBC Developer’s Guide, 11g Release 2 (11.2) available online, but when it came to the initial documentation following the 11.2.0.1 release on Linux it was pretty useless. The good news is that it improved!

The RAC FAN Java API

The aim of this API is to allow a Java application to listen to FAN events by creating a subscription to the RAC nodes’ ONS processes. The application then registers a FANListener, based on the subscription, which can pick up instances of the following events:

  • ServiceDownEvent
  • NodeDownEvent
  • LoadAdvisoryEvent

All of these are in the oracle.simplefan namespace, the javadoc reference of which you can find in the official documenation.

When it was initially released in 11.2.0.1 I tried to get the FAN subscription to work without any luck, the registration with the ONS didn’t work. Schematically, this is how it should work:

The application (shown on the top) requires an Oracle client for the ONS configuration and libraries, and the setup is very similar to how you’d set up FCF. With the ons.config file pointing to the RAC nodes’ ONS processes you should be able to read FAN events as they occur on the source system. The ons.config in $CLIENT_HOME/opmn/conf/ons.config has to contain the local and remote ONS port, as well as all the RAC nodes with their remote ONS port. When subscribing to the events, you specify the service name you’d like to receive events for.

My 4 node RAC cluster is comprised of node rac11gr2drnode{1,2,3}, with local ONS listing on port 6200 for external, and port 6100 for internal requests. This can be checked using the onsctl debug command on any cluster node.. The client’s ons.config contains these lines:

localport:6100
remoteport:6200
nodes=rac11gr2drnode1:6200, rac11gr2drnode2:6200,rac11gr2drnode3:6200, rac11gr2drnode4:6200

Unfortunately I couldn’t subscribe to the remote ONS-I know that my setup is valid because of my UCP/FCF testing I did previously (check my earlier blog postings about how to get started with UCP and Tomcat 6). Or maybe it doesn’t work on Windows, who knows? Trying to subscribe to the remote ONS my application bails out with the following error:

Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds.  Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.
at oracle.ons.SubscriptionNotification.waitForReply(SubscriptionNotification.java:83)
at oracle.ons.ONS.addSubscriber(ONS.java:956)
at oracle.ons.Subscriber.realStartup(Subscriber.java:103)
at oracle.ons.Subscriber.(Subscriber.java:79)
at oracle.simplefan.impl.FanSubscription.subscribe(FanSubscription.java:228)
at oracle.simplefan.impl.FanSubscription.createFanSubscription(FanSubscription.java:46)
at oracle.simplefan.impl.FanManager.subscribe(FanManager.java:120)
at de.mbh.TestClass.(TestClass.java:21)
at de.mbh.TestClass.main(TestClass.java:53)

I tried my favourite search engine on the Exception but couldn’t find any useful information. This is not a big deal for me, I can start a local ONS on the client. This is shown here:

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl ping
ons is not running ...

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl start
onsctl start: ons started

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl debug
HTTP/1.1 200 OK
Content-Length: 3627
Content-Type: text/html
Response:

== client:6200 5872 11/05/26 11:47:16 ==
Home: c:\oracle\product\11.2.0\client_1

======== ONS ========
IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
10.xxx.xx.98  6200 4dde2fb0 00000006 00000008

Listener:

TYPE                BIND ADDRESS               PORT  SOCKET
-------- --------------------------------------- ----- ------
Local                                  127.0.0.1  6100    344
Remote                                       any  6200    352

Servers: (3)

INSTANCE NAME                  TIME   SEQUENCE  FLAGS     DEFER
---------------------------------------- -------- -------- -------- ----------
dbInstance_rac11gr2drnode1_6200          4d53d6a4 00029496 00000002          0
10.xxx.xx.155 6200
dbInstance_rac11gr2drnode2_6200          4d3eb9e2 0006cbeb 00000002          0
10.xxx.xx157 6200
dbInstance_rac11gr2drnode3_6200 4d5937d6 0001990f 00000002          0
10.xxx.xx.158 6200

Connection Topology: (4)

IP                      PORT   VERS  TIME
--------------------------------------- ----- ----- --------
10.xxx.xx.158  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.157 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.157  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.155  6200     4 4dde2fb2
**                            10.xxx.xx.98 6200
**                           10.xxx.xx.157 6200
**                           10.xxx.xx.158 6200
10.xxx.xx.98  6200     4 4dde2fb2=
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                           10.xxx.xx.157 6200

Server connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF WSAQ
-------- --------------------------------------- ----- ------ ----- --- ----
0                           10.xxx.xx.155  6200 010405 00000 001  ---
1                           10.xxx.xx.156  6200 002405 00000 001
2                           10.xxx.xx.157  6200 010405 00000 001  --
3                           10.xxx.xx.158  6200 010405 00000 001  ---

Client connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF SUB W
-------- --------------------------------------- ----- ------ ----- --- --- -
4                                internal     0 01008a 00000 001 002
request                               127.0.0.1  6100 03201a 00000 001 000

Worker Ticket: 28/28, Last: 11/05/26 11:47:15

THREAD   FLAGS
-------- --------
120 00000012
124 00000012
128 00000012

Resources:

Notifications:
Received: Total 12 (Internal 6), in Receive Q: 0
Processed: Total 12, in Process Q: 0

Pool Counts:
Message: 1, Link: 1, Ack: 1, Match: 1

C:\oracle\product\11.2.0\client_1\opmn\bin>

With a local ONS started on my client, I can actually subscribe to the ONS and make use of the events. The easiest way is to simply decode the load balancing events, as I did in my code, shown below (modified version of the code in the Oracle documentation to make it work):

package de.mbh;

import oracle.simplefan.FanSubscription;
import oracle.simplefan.FanEventListener;
import oracle.simplefan.FanManager;
import oracle.simplefan.LoadAdvisoryEvent;
import oracle.simplefan.NodeDownEvent;
import oracle.simplefan.ServiceDownEvent;
import java.util.Properties;

public class TestClass {

TestClass() {
System.out.println("Hello");

Properties p = new Properties();
p.put("serviceName", "OEMSRV");

System.setProperty("oracle.ons.oraclehome", "c:\\oracle\\product\\11.2.0\\client_1");
System.out.println(System.getProperty("oracle.ons.oraclehome"));

FanSubscription sub = FanManager.getInstance().subscribe(p);

System.out.println("I'm subscribed!");

sub.addListener(new FanEventListener() {

public void handleEvent(ServiceDownEvent arg0) {
System.out.println("Service Down registered!");
}

public void handleEvent(NodeDownEvent arg0) {
System.out.println("Node Down Event Registered");
}

public void handleEvent(LoadAdvisoryEvent arg0) {
System.out.println("Just got a Load Advisory event");

System.out.println("originating database: " + arg0.getDatabaseUniqueName());
System.out.println("originating instance: " + arg0.getInstanceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Percent             : " + arg0.getPercent());
System.out.println("Service Name        : " + arg0.getServiceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Observed at         : " + arg0.getTimestamp() + "\n\n");
} } );
}

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TestClass tc = new TestClass();

int i = 0;
while ( i < 100000)  {
try {
Thread.sleep(100);
i++;
} catch (Exception e) {
System.out.println(e);
}
}

System.out.println("execution ended");
}

}

When compiling or executing the code, you need the simplefan.jar and ons.jar files in your classpath, and chances are that you need ojdbc6.jar as well.

Starting the application reveals Load Balancing events being read:

Hello
c:\oracle\product\11.2.0\client_1
I'm subscribed!
Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 51
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 49
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

This proves that you can write your own connection cache, which allows you to react to *down events and rebalance your session differently.

Interval Partitions

Here’s a little gem from the OTN database forum – why should the CPU usage of querying an “interval partitioned” table depend on the size of something that’s missing ?

(The most significant symptom is identified in a reply I wrote to the thread – but that’s only a step in the right direction).

Fedora 15: First Impressions…

It’s been nearly six months since I made the switch from CentOS to Fedora as my main desktop OS.

The Fedora 15 final release dropped a couple of days ago and I slapped it on my main desktop PC straight away. As usual, the first impression is all about the visuals. GNOME 3 looks great. I’m sure lots of people think KDE looks great too, but I tend to just stick with the default window manager, so it’s GNOME for me.

If you’ve read any of the press you will know that the menu bar and task bar have been removed. This is kinda weird at first. How does it affect me?

  • Task Bar: I was surprised how reliant I was on the task bar for switching between open apps. I would say about 50% of the time I was using the task bar and 50% of the time I was using alt+tab. Now I’m using alt+tab 100% of the time. I think this change has worked really well for me. I feel like I’m navigating quicker.
  • Menu Bar: I think this one will annoy a few people. In essence, the menu bar is still there, but one layer down. If you click on “Activies” in the top left (or hit the Windows key) you switch to the activities screen. There is a favorites doc on the left and if you click “Applications”, you get a menu (more like a filter) on the right of the screen. That’s all nice, but the thing I really like is if you click the Windows key and type in a few letters it returns all apps and items that contain those letters in the name. Similar to the Windows menu, but prettier. So if you insist on using the menu like an old-style menu, I think this change will annoy you as it requires an extra click and some animations. If you use the favorites doc and the search feature I think it’s quite cool.

The fancy visuals worked straight out of the box for my main desktop machine, but one of my other machines (with a better graphics card) couldn’t handle GNOME 3 and ran using fallback mode. Fallback mode is pretty much like previous GNOME releases with a menu bar and task bar. I’m sure some people will prefer fallback mode, but I think the new stuff is certainly worth a try.

If you really can’t handle the new interface you can manually switch to fallback mode. Start up the System Info dialog (Activities > Applications > System Settings > System Info), click “Graphics” , flick the “Forced Fallback Mode” switch and relog.

As for the OS itself, I’ve had no dramas so far, but it is early days. Time will tell…

By the way, I did the usual Oracle on Fedora thing.

Cheers

Tim…




Parsing 10046 traces

UPDATE July 9, 2011: README.TXT    10046.pl

 

Often when I have a 10046 trace file, especially when looking at I/O issues, I want a histogram of I/O response time.  To get I/O response time I’ve hacked out incomple awk scripts from time to time, always meaning to write a more complete one, well now I don’t have to. It’s already been done!

Here is a cool perl script from Clive Bostock: 10046.pl

(also checkout orasrp which produces a more indepth report in HTML. I like both. I like 10046.pl as a short  easy portable script that I can modify, whereas orasrp is a binary and only works on some ports)

For example, if I trace a session with 10046, and retrieve the tracefile, then I can run:

$ 10046.pl -t  mytrace.trc

and it will output  a header and three sections

Header

  • Summary of all events for tracefile
  • Events by object summary
  • Events by object histogram

This looks like

Header

Trace file mytrace.trc
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u02/oracle
System name:    SunOS
Node name:      toto
Release:        5.10
Version:        Generic_142900-12
Machine:        sun4u
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 177
Unix process pid: 16553, image: oracle@toto
Trace input file : mytrace.trc

Wait summary

EVENT AGGREGATES
================
Wait Event              Count Elapsed(ms)   Avg Ela (ms)  %Total
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~
db file sequential read  2715      11593              4    3.74
       direct path read  4484       4506              1    1.45
 db file scattered read   141        898              6    0.29
          log file sync     3          8              2    0.00
                              ~~~~~~~~~~~~
             Total Elapsed:       309821

Wait Summary by object

Object Id  : Wait Event                  Count Tot Ela (ms) %Total Avg Ela (ms)
~~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~ ~~~~~ ~~~~~~~~~~~~~
28581      : direct path read            4484         4506   1.45            1
1756433    : db file sequential read      725         1891   0.61            2
764699     : db file sequential read      332         1762   0.57            5
37840      : db file sequential read      200         1044   0.34            5
38018      : db file sequential read      108         1009   0.33            9
81596      : db file scattered read       140          887   0.29            6

wait histogram by object

EVENT HISTOGRAM BREAKDOWN
===========================
This section splits the event counts into elapsed time
buckets so that we can see if there are any suspiciousn
or anomalous response time / frequency patterns.
Object Id : Wait Event              <1ms <2ms <4ms <8ms <16ms <32ms <64ms <128ms <256ms <512ms >=1024ms
~~~~~~~~~ : ~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~~ ~~~~~ ~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~
28581     : direct path read        7680   87  148  221  144    40     4     0      0       0        0
1756433   : db file sequential read  606  268   45   35   66     6     2     0      0       0        0
764699    : db file sequential read   74  119   11   78   78     9     0     0      0       0        0
37840     : db file sequential read   50   72    6   45   47     5     0     0      0       0        0
38018     : db file sequential read   12   38    7   10   30    12     5     0      0       0        0
81596     : db file scattered read    64    4   13   62   18     8     3     0      0       0        0
41995     : db file sequential read   20   39    0    7   16     8     4     0      0       0        0
108718    : db file sequential read   74   54    5   12   24     4     0     0      0       0        0
33490     : db file sequential read    0    5   11   25   19     4     0     0      0       0        0

WordPress 3.1.3 Released…

WordPress 3.1.3 has been released. Happy upgrading…

Cheers

Tim…




audit

Here’s one of those funny little details that can cause confusion:

SQL> select * from user_audit_object;

no rows selected

SQL> audit select on indjoin by session whenever successful;

Audit succeeded.

SQL> select
  2     count(*)
  3  from
  4     indjoin         ij
  5  where
  6     id between 100 and 200
  7  and        val between 50 and 150
  8  ;

  COUNT(*)
----------
        51

1 row selected.

SQL> select * from user_audit_object where obj_name = 'INDJOIN';

no rows selected

So we had nothing in our “object audit trail”, then we enabled auditing on one particular table for select statements (and audit_trail = db has been set in the parameter file) but our select statement hasn’t been audited. What went wrong ?

Let’s try a different query against the audit view which, until a moment ago, held no data:


SQL> select obj_name from user_audit_object;

OBJ_NAME
-----------------------------------------------
index$_join$_001

So we have audited something – but what ? Here’s the execution plan for our query:

----------------------------------------------------------------------------------------
| Id  | Operation           | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                  |     1 |     8 |     5  (20)| 00:00:01 |
|   1 |  SORT AGGREGATE     |                  |     1 |     8 |            |          |
|*  2 |   VIEW              | index$_join$_001 |     3 |    24 |     5  (20)| 00:00:01 |
|*  3 |    HASH JOIN        |                  |       |       |            |          |
|*  4 |     INDEX RANGE SCAN| IJ_PK            |     3 |    24 |     3  (34)| 00:00:01 |
|*  5 |     INDEX RANGE SCAN| IJ_VAL           |     3 |    24 |     3  (34)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("VAL"<=150 AND "ID"<=200 AND "ID">=100 AND "VAL">=50)
   3 - access(ROWID=ROWID)
   4 - access("ID">=100 AND "ID"<=200)
   5 - access("VAL">=50 AND "VAL"<=150)

See that index$_join$_001 in line 2 ? We have a plan that uses the “index hash join” mechanism to query that table – so Oracle audits the query, but unfortunately uses the name of the internal view in the audit record.

Be very careful how you audit objects; you may think that an object has not been accessed when it has (sort of). If you’ve started doing this type of auditing to check whether or not an object is ever accessed you could be fooled.

You could claim, of course, that the object hasn’t really been accessed – but compare the index join above with the following (cut-n-paste) example, which I ran after deleting everything from the aud$ table:

SQL> connect test_user/test
Connected.

SQL> select obj_name from user_audit_object;

no rows selected

SQL> set autotrace on explain
SQL> select count(*) from indjoin where id = 23;

  COUNT(*)
----------
         1

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 689603510

----------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |     1 |     4 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |       |     1 |     4 |            |          |
|*  2 |   INDEX UNIQUE SCAN| IJ_PK |     1 |     4 |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("ID"=23)

SQL> set autotrace off
SQL> select obj_name from user_audit_object;

OBJ_NAME
----------------
INDJOIN

1 row selected.

We haven’t actually visited the indjoin table, but the audit code recognises that that was the intent, even though the optimizer made physical access unnecessary. The behaviour ought to be the same for both cases.

(This example orginally came from a client site, by the way – with the question “how come the audit trail says we’re not visiting this table?”. The client version, and the version I tested on, was 10.2.0.3. There were a couple of other little surprises I found at the same time – I may find time to write about them some other day.)

TCP monitoring strangeness continued

On
http://dboptimizer.com/2011/04/27/envisioning-nfs-performance/
I showed some code that displayed the send and receive sizes and times over TCP on Solaris 10 with the dtrace command (see tcp.d for the code). I took this code on another machine and got errors like

“dtrace: error on enabled probe ID 29 (ID 5461: tcp:ip:tcp_input_data:receive): invalid alignment (0xffffff516eb5e83a) in predicate at DIF offset 136″

Not quite sure why this was happening but by a process of elimination I found that accessing args[4]  in tcp:::receive caused these errors.

Fortunately many of the values in args[4] are found in args[3] as well.

To find arguments to tcp:::receive , first run the following command  (or alternatively look the TCP probes up on the wiki at http://wikis.sun.com/display/DTrace/tcp+Provider)

( for command line flags check out http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?dtrace+7

  • -l = list instead of enable probes
  • -n = Specify probe name to trace or  list
  • -v = Set verbose mode

i.e. list the verbose information about the probes named and don’t enable these probes, just list them)

$ dtrace -lvn tcp:::receive
 5473        tcp                ip                        tcp_output send

        Argument Types
                args[0]: pktinfo_t *
                args[1]: csinfo_t *
                args[2]: ipinfo_t *
                args[3]: tcpsinfo_t *
                args[4]: tcpinfo_t *

(by the way, there are a number of probes that match tcp:::receive, but they all have the same arguments, I didn’t notice all these different tcp:::receive until I actually ran the above command. Before running the command I’d depended on the wiki. Now I’m wondering what the difference is between some of these tcp:::receive and tcp:::send probes )

After finding the args for a probe, you can look up the definition of the structs at  http://cvs.opensolaris.org/source/

tcp:::send and tcp:::receive both have the same arguments

  • args[3] is tcpsinfo_t
  • args[4] is tcpinfo_t

Looking up the structs at http://cvs.opensolaris.org/source/, I find the contents of the structs as follows:

tcpsinfo_t  ( args[3] )

     111 typedef struct tcpsinfo {
    112 	uintptr_t tcps_addr;
    113 	int tcps_local;			/* is delivered locally, boolean */
    114 	int tcps_active;		/* active open (from here), boolean */
    115 	uint16_t tcps_lport;		/* local port */
    116 	uint16_t tcps_rport;		/* remote port */
    117 	string tcps_laddr;		/* local address, as a string */
    118 	string tcps_raddr;		/* remote address, as a string */
    119 	int32_t tcps_state;		/* TCP state */
    120 	uint32_t tcps_iss;		/* Initial sequence # sent */
    121 	uint32_t tcps_suna;		/* sequence # sent but unacked */
    122 	uint32_t tcps_snxt;		/* next sequence # to send */
    123 	uint32_t tcps_rack;		/* sequence # we have acked */
    124 	uint32_t tcps_rnxt;		/* next sequence # expected */
    125 	uint32_t tcps_swnd;		/* send window size */
    126 	int32_t tcps_snd_ws;		/* send window scaling */
    127 	uint32_t tcps_rwnd;		/* receive window size */
    128 	int32_t tcps_rcv_ws;		/* receive window scaling */
    129 	uint32_t tcps_cwnd;		/* congestion window */
    130 	uint32_t tcps_cwnd_ssthresh;	/* threshold for congestion avoidance */
    131 	uint32_t tcps_sack_fack;	/* SACK sequence # we have acked */
    132 	uint32_t tcps_sack_snxt;	/* next SACK seq # for retransmission */
    133 	uint32_t tcps_rto;		/* round-trip timeout, msec */
    134 	uint32_t tcps_mss;		/* max segment size */
    135 	int tcps_retransmit;		/* retransmit send event, boolean */

tcpinfo_t (args[4])

     95 typedef struct tcpinfo {
     96 	uint16_t tcp_sport;		/* source port */
     97 	uint16_t tcp_dport;		/* destination port */
     98 	uint32_t tcp_seq;		/* sequence number */
     99 	uint32_t tcp_ack;		/* acknowledgment number */
    100 	uint8_t tcp_offset;		/* data offset, in bytes */
    101 	uint8_t tcp_flags;		/* flags */
    102 	uint16_t tcp_window;		/* window size */
    103 	uint16_t tcp_checksum;		/* checksum */
    104 	uint16_t tcp_urgent;		/* urgent data pointer */
    105 	tcph_t *tcp_hdr;		/* raw TCP header */
    106 } tcpinfo_t;

In the script I accessed the TCP seq and ack in arg[4], which was giving me errors in tcp:::receive, so I just switch these for the equivalents in arg[3]. Now I’m not exactly clear on the equivalence, but it seems

  • args[4]->tcp_seq  ?= args[3]->tcps_suna
  • args[4]->tcp_ack  ?= args[3]->tcps_rack

The ack=rack seems solid as tcps_rack = “highest sequence number for which we have received and sent an acknowledgement”.

The seq=tcps_suna is less clear to me as tcps_suna = “lowest sequence number for  which we have sent data but not received acknowledgement”

But for my purposes, these distinctions might be unimportant  as  I’ve stopped looking at seq and ack and now look at outstanding unacknowledge bytes on the receiver and sender, which is

  • tcps_snxt-tcps_suna ” gives the number of bytes pending acknowledgement “
  • tcps_rsnxt – tcps_rack  ” gives the number of bytes we have received but not yet acknowledged”

but more about that later. Let’s look at the new version of the program and the output

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %12s %s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "seq",
        "ack"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %8d %8d %12s > %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[4]->tcp_seq  -
        args[3]->tcps_suna ,
        args[4]->tcp_ack -
        args[3]->tcps_rack ,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %8d %8d %12s < %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_rack % 10000,
        args[3]->tcps_suna % 10000,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}

output

starting up ...
    delta      cid    pid     send     recd
      570     3904    845          <-- 0
       34     3904    845          <-- 140
      455     3904    845          <-- 0
       24     3904    845          <-- 0
  4789720     3904    845      124 -->
       82     3904    845      244 -->
       99     3904    845      132 -->
    delta      cid    pid     send     recd
       52     3904    845     1448 -->
       28     3904    845     1448 -->
       24     3904    845     1448 -->
       36     3904    845     1448 -->
       33     3904    845     1448 -->
       26     3904    845      952 -->
       86     3904    845      244 -->
      212     3904    845          <-- 140
      501     3904    845          <-- 132
       60     3904    845      124 -->
      256     3904    845          <-- 140
       72     3904    845          <-- 0
    39658     3904    845          <-- 0

What the heck is that huge time 4789720 us? ie 4 secs? The whole operation took me less than 1 second.
I wouldn’t have found the answer to this without help from Adam Levanthal. Turns out that output is in order only per CPU, but different CPUs can output data in different order. Each CPU buffers up data and then passed the buffer back to userland dtrace,  so on a one CPU machine, this code will always output in order, but on a multi-cpu machine there is no guarentee on the order of the output. Lets add CPU # to the output:

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %4s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "cpu#"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        cpu
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        cpu
      );
    title--;
}

output

  delta      cid    pid     send     recd cpu#
       42     3904    845      244 -->          0
       66     3904    845      124 -->          0
     6091     3904    845      124 -->          2
       81     3904    845      244 -->          2
       96     3904    845      132 -->          2
       31     3904    845     1448 -->          2
       20     3904    845     1448 -->          2
       18     3904    845     1448 -->          2
       17     3904    845     1448 -->          2
       16     3904    845     1448 -->          2
  8910406     3904    845        0 -->          3
      375     3904    845          <-- 0        3
       24     3904    845          <-- 140      3
       34     3904    845        0 -->          3
      470     3904    845          <-- 140      3
      410     3904    845          <-- 132      3
    delta      cid    pid     send     recd cpu#
      491     3904    845          <-- 140      3
      393     3904    845          <-- 0        3
       15     3904    845      952 -->          3
       36     3904    845          <-- 0        3
    delta      cid    pid     send     recd cpu#
       19     3904    845          <-- 0        3
    40167     3904    845          <-- 0        3

what we see is the data ordered by CPU. In other words for each CPU the data is ordered but which CPU get’s printed first is unknown. In dtrace each CPU buffers up it’s data and then sends it to the userland dtrace process.
The only “fix” for now is to add a timestamp and order the data by timestamp.
Unsorted, it looks like:

607858102997348       281     3904    845      124 -->          2
607858103608856        84     3904    845      244 -->          2
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858102716187  11569935     3904    845          <-- 132      3
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

sorted it looks like

607858102716187  11569935     3904    845          <-- 132      3
607858102997348       281     3904    845      124 -->          2
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103608856        84     3904    845      244 -->          2
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

so now the strange long time delta is at the beginning where I’d expect it.

I’m not quite sure how to deal with this. Post processing the data by sorting the timestamp column works, but interactively processing the data to get it in the right order as it comes out seems problematic.

Philosophy – Measurement

class="tweetmeme_button" style="float: right; margin-left: 10px;"> /> />

In my professional opinion, if your performance work is to have any credibility then you need to agree with and abide by this philosophy. I think it’s really what separates the real experts from the amateurs.

“Measurement is the first step that leads to control and eventually to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it.” – Dr. H. James Harrington

Discuss.

name="googleone_share_1" style="position:relative;z-index:5;float: right; margin-left: 10px;">{lang: 'ar'} class='wpfblike' >

Finally on Twitter

Overheard in an IRC chat room (Freenode#oracle) this morning…