Search

OakieTags

Who's online

There are currently 0 users and 39 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Metadata bug

Here’s a funny little bug – which probably won’t cause any damage – that may remind you that (most of) the people who work for Oracle are just ordinary people like you and me who make ordinary little mistakes in their programming. It’s a bug I discovered by accident because I just wanted to check something about how a particular undo tablespace had been defined, and I called dbms_metadata instead of querying dba_tablespaces. Here’s the cut-n-paste from an SQL*Plus session on 11.2.0.2:


SQL> select dbms_metadata.get_ddl('TABLESPACE','UNDOTBS1') text from dual;

TEXT
----------------------------------------------------------------------------------------------------

  CREATE UNDO TABLESPACE "UNDOTBS1" DATAFILE
  SIZE 209715200
  AUTOEXTEND ON NEXT 5242880 MAXSIZE 32767M,
  '/u01/app/oracle/oradata/TestDB11/redo03.log' SIZE 209715200
  AUTOEXTEND ON NEXT 5242880 MAXSIZE 32767M,
  '/u01/app/oracle/oradata/TestDB11/undotbs01.dbf' SIZE 209715200
  AUTOEXTEND ON NEXT 5242880 MAXSIZE 32767M
  BLOCKSIZE 8192
  EXTENT MANAGEMENT LOCAL AUTOALLOCATE
   ALTER DATABASE DATAFILE
  '/u01/app/oracle/fast_recovery_area/TESTDB11/flashback/o1_mf_8v4qrxy1_.flb' RESIZE 959447040
   ALTER DATABASE DATAFILE
  '/u01/app/oracle/oradata/TestDB11/redo03.log' RESIZE 959447040
   ALTER DATABASE DATAFILE
  '/u01/app/oracle/oradata/TestDB11/undotbs01.dbf' RESIZE 959447040

SQL> select file_name from dba_data_files where tablespace_name = 'UNDOTBS1';

FILE_NAME
----------------------------------------------------------------------------------------------------
/u01/app/oracle/oradata/TestDB11/undotbs01.dbf

SQL> select member from v$logfile;

MEMBER
----------------------------------------------------------------------------------------------------
/u01/app/oracle/oradata/TestDB11/redo01.log
/u01/app/oracle/oradata/TestDB11/redo02.log
/u01/app/oracle/oradata/TestDB11/redo03.log

As you can see, a call to dbms_metadata.get_ddl() is telling me that my undo tablespace is made up of two files (and then wants to alter three (!) of them). One of which looks suspiciously like it might be one of my redo log files, and the third file is one of the flashback logs in my fast recovery area! So I queried dba_tablespaces to find that (as expected) I had only one file in my undo tablespace, and then I queried v$logfile to check whether I really did have a log file called redo03.log and I checked the fast recovery area to see when that last file had appeared.

I don’t suppose this error in dbms_metadata will have any serious side effects – although I did wonder if there might be some ramifications for a “full transportable export” (an 11.2.0.3 option) from datapump.

A quick check on MOS revealed this to be a known bug fixed in 12.1: “Bug 10177856 : DBMS_METADATA(‘TABLESPACE’, …) INCLUDES WRONGLY TEMPFILES AND ONLINE REDO-LOGF” The specific bug (and there were three reported by my search) described the SYSTEM tablespace report also reporting the temporary tablespace(s) – so I did a quick check on my SYSTEM and SYSAUX to see what would happen, and the results made it easy to see how the problem had happened – when it was collecting filenames the code in dbms_metadata had failed to check file types properly.

My query for SYSTEM reported system.dbf along with temp01.dbf and redo01.log and an OMF-named flashback log.
My query for SYSAUX report sysaux.dbf along with temp_special.dbf and redo02.log and an OMF-named flashback log.

Here’s a query (runnable only by SYS) showing why those particular extra files appeared on my system:

SQL> select fnfno, fntyp,fnnam from x$kccfn where fnfno <= 3 order by fnfno, fntyp;

     FNFNO      FNTYP FNNAM
---------- ---------- ---------------------------------------------------------------------------
         0        200 /u01/app/oracle/oradata/TestDB11/bct.dbf

         1          3 /u01/app/oracle/oradata/TestDB11/redo01.log
                    4 /u01/app/oracle/oradata/TestDB11/system01.dbf
                    7 /u01/app/oracle/oradata/TestDB11/temp01.dbf
                   24 /u01/app/oracle/fast_recovery_area/TESTDB11/flashback/o1_mf_8v439xsc_.flb

         2          3 /u01/app/oracle/oradata/TestDB11/redo02.log
                    4 /u01/app/oracle/oradata/TestDB11/sysaux01.dbf
                    7 /u01/app/oracle/oradata/TestDB11/temp_special.dbf
                   24 /u01/app/oracle/fast_recovery_area/TESTDB11/flashback/o1_mf_8v43b0hp_.flb

         3          3 /u01/app/oracle/oradata/TestDB11/redo03.log
                    4 /u01/app/oracle/oradata/TestDB11/undotbs01.dbf
                   24 /u01/app/oracle/fast_recovery_area/TESTDB11/flashback/o1_mf_8v4qrxy1_.flb
 

The code is reporting the names of the corresponding files of types 3, 7 and 24 whenever it reports a file of type 4 (which is the data files). There are other possible file types, of course, and if I had other features enabled perhaps I would see some of them. Running dbms_metadata.get_ddl with tracing enabled showed me that it queries view ku$_tablespace_view, which includes a join to view ku$_file_view, which includes a three-part union all of x$kccfn including subqueries for files of type 4 and 7 specifically, and a subquery with no restriction on file type but a join on file number to the flashback file list x$ktfthc.

Although the bug is reported as fixed in 12.1, it seems to be fixed in 11.2.0.3 (so that deals with my worry about datapump) – it was just unlucky (or lucky, depending on your viewpoint) that I happened to run my first query on version 11.2.0.2.

DBaaS : OEM Snap Clone

Oracle OEM 12c introduces a new feature that enables the creation of Oracle database thin clones by leveraging file system snapshot technologies from either Oracle ZFS Storage Appliance or Netapp.  The OEM adds a graphic interface to the process of making database thin clones. The feature that enables database thin cloning in OEM is called Snap Clone and is part of OEM’s Cloud Control Self Service for data cloning. Snap Clone is available via the feature Database as a Service (DBaaS). Snap clone leverages the copy on write technologies available in some storage systems for database cloning.  Support is initially available for NAS storage and specifically on Oralce ZFS Storage Appliacen and NetApp Storage.

In order to use Snap Clone, one has to install the source database such that the source database data files are on a ZFS storage appliance or Netapp array and have the storage managed by agents on a LINUX machine and then one can thin clone data files on that same storage array.

Snap Clone offers role based access, so storage admin can login in and only have access to areas they are responsible for as well as limiting access to source databases, clones and resource by end users.

Setting Snap Clone

 

The prerequisites for getting start with Snap Clone are having available storage on ZFS Storage Appliance or Netapp storage array as well as having access to a master test database. A master test databse is a database that has a sanatized version of a production database such that it is either a subset and or  masked. The test master database has to be registered with OEM.  After the test master is registered with OEM, Snap Clone can be setup. To set up snap clone, come into Oracle Cloud Control 12c as “cloud administrator” role with “storage adminstator” priviledge or super administrator and register the storage. To register the storage navigate to “ setup -> provisining patching -> storage registration”.

  • Navigate to “ setup -> provisining patching -> storage registration”
  • Click “Register” tab, and choose storage, either Netapp or ZFS,
    • Supply storage information
      • Name: Storage array name registered in DNS
      • Vendor
      • Protocol: http or https
      • Storage Credentials: credentials for interacting with storage
  • Install agents on a separate LINUX machine to manage the Netapp or ZFS storage. An agent has to run on Linux host to manage the  storage. Supply the
    • Agent host
    • Host credentials
  • Pick a database to make the test master
    • Put the test master on ZFS storage or Netapp storage
    • Register the ZFS storage or Netapp storage with OEM
    • Enable Snap Clone for the  test master database
  • Set up a zone – set max CPU and Memory for a set of hosts and the roles that can see these zones
  • Set up a pool – a pool is a set of machines where databases can be provisioned
  • Set up a profile – a source database that can be used for thin cloning
  • Set up a service template – reference values such as a init.ora for database to be created

Screen Shot 2013-06-09 at 10.08.47 PM

Figure 1. Shows the entry page in OEM 12c Cloud Control. From here go to the top right and choose setup, then provisining patching then storage registration as shown above.

 Navigate to storage registration

To setup Snap Clone navigate to storage registyratiom choose the menus “setup -> provisining patching -> storage registration”.

Screen Shot 2013-06-09 at 10.10.44 PM

Figure 2 Shows a zoom into the menus to choose

Screen Shot 2013-05-31 at 10.09.02 PM

Figure 3. Storage Registration Page

Screen Shot 2013-06-09 at 10.15.04 PM

Figure 4. Choose the type of storage array

Once on the Storage Registration page, choose “Register” and then choose the storage, either Netapp or Sun ZFS.

Register the Storage

Screen Shot 2013-06-09 at 10.17.02 PM

Figure 5. Storage Registration Page

To register the storage supply the following information

  • Name: Storage array name registered in DNS
  • Vendor
  • Protocol: http or https
  • Storage Credentials: credentials for interacting with storage

All of which is documented in cloud administration guide.

Define agents used to manage storage

Then define agents used to manage storage. Agents have are required to run on a LINUX host.  More than one agen can define to provide redundancy.  The agents will be the path by which OEM communicates with the storage. For each agent, supply the following information

  • Host name
  • Credential type
  • Credentials

And finally define the frequency with which the agent synchronizes with the storage to gather the sorage hardware details such as  information on aggregates shares volumes.

After the storage information, agent information and agent synchronization information has been filled out, then hit “submit” button in the top right. Hitting the submit button will return the UI back to the “Storage Registration”. On the “Storage Registration”, click on the storage appliance listed in the top, then click on the contents tab on the bottom half of the page. This will list all the  volumns and aggregates in the storage appliance.

Looking at volumns on Storage Array

Screen Shot 2013-06-09 at 10.18.48 PM

Figure 6. Editing storage ceiling by clicking on a aggregate and hitting the “Edit Storage Config” tab.

For each aggregate one can set storage ceilings. Click on the aggregate or FlexVol and the click “Edit Storage Ceilings” tab.

Choosing a Database Test Master

On the database tab is a list of databases that can be used for cloning. OEM detects the database automatically on the hosts it is managing. OEM will also automatically correlate databases that have storage on the storage array added storage registration.  OEM looks for all databases that have files on the registered storage.  Click on database, then the show files tabs which will show the files and volumes for this database.

Screen Shot 2013-06-09 at 10.20.00 PM

Figure 7. List of files by volumn for database

Screen Shot 2013-06-09 at 10.20.07 PM

Figure 8. Enable Snap Clone for databases that will be used as test masters.

Nominating a database as test master requires enabling snap clone. To enable snap clone for a database, click on the chosen database, then click “Enable Snap Clone” tab just above the list of databases. This will automatically validate that all the volumes are flex clone enabled (in the case of Netapp).

 

Setting up Zones

The next step is to configure zone which can be used to organize cloud resources

Choose the menu option “Enterprise -> Cloud -> Midelware and Database Home”

 

Screen Shot 2013-06-09 at 10.23.11 PM

Figure 9. Navigate first to “Cloud -> Middleware and Database Home”

Middleware and Database Cloud page

Screen Shot 2013-06-09 at 10.24.23 PM

Figure 10.  Middleware and Databawe Cloud page

Setting up a Zone

In order to see the zones defined, click on the number next to the title “Paas Infrastructure Zones”  in the top left under General Information.

Screen Shot 2013-06-09 at 10.26.24 PM

Figure 11. PaaS Infrastructure Zones

To create a zone, click the tab “Create”.

Screen Shot 2013-06-09 at 10.27.33 PM

Figure 12. first page of wizard to create a PaaS Infrastructure Zone, give a meaningful name and description of the zone and define maximum CPU utilizaiton and memory allocation.

In the first page of the “PaaS Infrastructure Zone”, give zones a meaningful name and description. Define constraints such as maximum host CPU and memory allocations.

Screen Shot 2013-06-09 at 10.28.56 PM

Figure 13. Second page of the “PaaS Infrastructure Zone” wizard, add hosts that are available to the zone.

Next define hosts that are members of the zone and provide credentials that operate across all members of this zone

Screen Shot 2013-06-09 at 10.30.11 PM

Figure 14. Third page of the “PaaS Infrastructure Zone” wizard, limit which roles can see the zone.

Next define what roles can see and access this szone. The visibiliy of the zone can be limited to a certain class of users via roles like Dev, QA etc

 Screen Shot 2013-06-09 at 10.31.18 PM

Figure 15. Final review page for “PaaS Infrastructure Zone” wizard

Finally review settings and click submit

Screen Shot 2013-06-09 at 10.32.29 PM

Figure 16. Showing the Confirmation that the PaaS Infranstructure Zone has been successfully created.

 

Creating Database Pool and Profiles

The remaining steps required to enable snap clone is to create a database pools which is a collection of servers or nodes that have database software installed.  The remaining part of the setup is done by a differnet user who is the administrator for database as a service.

Log in as  DBAAS_ADMIN.

For the next part navigate to the menu “Setup -> Cloud -> Database”.

Screen Shot 2013-06-09 at 10.33.24 PM

Figure 17. Middleware and Database Cloud page

Screen Shot 2013-06-09 at 10.35.59 PM

Figure 18. Navigate to “Setup -> Cloud -> Database”.

Screen Shot 2013-06-09 at 10.36.56 PM

Figure 19. Database Cloud Self Service Portal Setup. To create a database pool choose the “Create” button in the center of the page and from the pull down, choose “For Database”.

To create a new pool click on the “Create” button in the center of the page, and chose “For Database” from the pull down menu that appears.

Screen Shot 2013-06-09 at 10.38.16 PM

 Figure 20. Choose “Create -> For Database”

Screen Shot 2013-06-09 at 10.39.10 PM

Figure 21. Edit pool page. Provide a meaningful name a descrpition of the pool. Add Oracle home directories in the bottom of the page. At the very bottom of the page set a constraint on the number of databases instances that can be created in the pool. On the top right, set the host credentials.

Set

  • Name and description
  • Oracle Home
  • Maximum number of databases per host
  • Credentials

In the “Edit Pool” page, at the top left of the screen, provide a meaningful name and description for the pool. In the middle of the screen add Oracle homes that will be used for databse provisioning.  Every member of a database pool is required to be homogeneous. Homogenous requires that the platform and Oracle version is the same across all the hosts  and Oracle homes in the pool. All the Oracle installations also have to be of the same type either single instance or RAC. In the top right  add the  Oracle home provide oracle credentials and root credentials. Finally at the bottom of the page a constraint can be set on the number of database instances that can be started in this pool.

Screen Shot 2013-06-09 at 10.41.03 PM

Figure 22. Set request limits on the pool

The next page sets the request settings. The first restriction sets how far in advanced can requrest can be made. Second restricts  how long a request can be kept which is the archive retension.  After the archive retention time  the  requests will be deleted.  Finally is the request duration which is the maximum duration for which the request can be made.

Screen Shot 2013-06-09 at 10.42.12 PM

Figure 23. Set memory  and storage quotas per role for the pool. The quotas cover memory, storage, database requests and schema requests.

 

The above page  configures quotas.  Quota is allocated to each and every self service user. The quotas controls the amount fo resources users  have access to. Quotas are assigned to a role and users inherit quota values from the role. Click “Create” in the middle of the screen.

Screen Shot 2013-06-09 at 10.43.03 PM

Figure 24. Editing the quotas on a pool for a  role.

The popup dialogue has for these entries

  • Role name
  • memory GB
  • storage GB
  • number of database request

Screen Shot 2013-06-09 at 10.43.56 PM

Figure 25. Profiles and Service Templates

Profiles and service templates

.A profile is use to capture information about the source database which can then be used for provisioning.

A service template is a standardized service definition for a database configuration that is offered to the self service users. A collection of service templates forms the  service catalogue. A service template will provision databsae with or  without seed data. To capture an ideal configuration, the easist thing to do is to point at an existing database and fetch information of interest from that database.  The information from the database can be captured using a profile.

To create a profile click on the “Create” button under “Profiles”

Creating a profile

 Screen Shot 2013-06-09 at 10.45.07 PM

Figure 26. Specify a reference target for a profile.

Click the magnifying glass to search for a source database.

Screen Shot 2013-06-09 at 10.46.46 PM

Figure 27. Search for a reference target database

 

Pick a refrence target by clicking on it, the click the “Select” button in bottom right.

 Screen Shot 2013-06-09 at 10.47.31 PM

Figure 28. Creating a Database Provisioning Profile

To pick a database for use in thin cloning, choose the check box “Data Content” with suboption selected fro “Structured Data” with sub-option selected for “Create” with sub-option selected for “Storage Snapshot”. This  option is only enabled only when the “enable snapshot” option is enabled on the storage registration page. Disable option capture oracle home.

Provide credentials for the host machines Oracle account and for the database login.

 

The “Content Option” step is not needed for the above selections.

Screen Shot 2013-06-09 at 10.48.16 PM

Figure 29. Give the profile a meaniful name and description

 

Next provide credentials for Oracle home and Oracle databse, then provide a meaningful name for the profile  as well as a location. The profile will be userful when creating a service template.

 Screen Shot 2013-06-09 at 10.49.12 PM

Figure 30. Review of create profile options.

 

Next  review the summary and click subit which will connect to storage and take snapshots of the storage


Screen Shot 2013-06-09 at 10.50.46 PM

 Figure 31. Shows a zoom into the menus to choose

 

To  create a new service template choose a profile and  in this case use “thin provisioning for reference DB” profile.  Now to create a new service template click “create” and choose “for database”. Service templates are part of the service catalogue and exposed to the self service users.

 Screen Shot 2013-06-09 at 10.51.42 PM

Figure 32. Provide a meaningful name and description for the service template.

 

Provide a meaningful name and description. For the rest of service template provide information about the databses that will be created from the snapshots such as providing database type, rac or single instance, for rac provide number of nodes. Provide the SID prefix to appended to the SIDs generated for the clones, provide the Domain Name and the port.

Screen Shot 2013-06-09 at 10.52.38 PM

Figure 33. Provide a storage area for writes to use.

 

The   cloning operation only creates a read only copy thus it is required to provide write space elsewhere in order to allow writing to the thin clone.

click on the edit button

click on volumne, then edit button

Screen Shot 2013-06-09 at 10.53.29 PM

Figure 34. set diretory and maximum space usage for the write location

 

Provide the mount point prefix and amount of writeable space wish to allocate

 

Users of the thin clone databses can  also be allowed to take further snapshots. These snapshots can be used as a mechinism to rollback changes, The number of thiese snapshtos can limited just below storage size section:

 Screen Shot 2013-06-09 at 10.54.18 PM

Figure 35. set the number of snapshots that can be taken of a a thin clone

Screen Shot 2013-06-09 at 10.55.15 PM

Figure 36. set the initial passwords for database accounts on the thin clone.

next provide credentials for administrative accounts

  • SYS
  • SYSMAN
  • DBSMNP

for all other non-administartive accounts can choose to leave them as is or change them all to one password you can modify certain init.ora parameters for exmaple memory

 Screen Shot 2013-06-09 at 10.55.59 PM

Figure 37. Modify any specific init.ora parameters for the thin clone

 

Screen Shot 2013-06-09 at 10.57.13 PM

Figure 38. Set any pre and post provision scripts to be run at the creation of a thin clone.

 

custom scripts can be provide as pre or post creation steps this can be very useful if you want to register databses with  OID or certian actions that are specific to your organization

 Screen Shot 2013-06-09 at 10.58.05 PM

Figure 39. Set the zone and pool for the thin clone

Screen Shot 2013-06-09 at 10.59.07 PM

Figure 40. set the roles that can use the service template.

you associate this srvice template with a zone and a template this insures that the service template can actualy work on the  group of resources that you have identified and can limit the visibility of the service tempalte usein roles

Screen Shot 2013-06-09 at 11.00.17 PM

Figure 41. review of the service template creation requites

finally we review the summary and click submit

Creating  a Thin Clone

Screen Shot 2013-06-09 at 11.06.14 PM

Figure 42. 12c Cloud Control

Screen Shot 2013-06-09 at 11.07.06 PM

Figure 43. 12c Cloud Control Self Service Portal

Contents of the Database Cloud Self Service Potal screen

  • Left hand side
    •  Notification – any instances that are about to expire
    •   Usage
      •     databases (number provisioned out of maximum)
      •     schema services
      •     Memory
      •     Storage
  • Right side
    •   Top
      •     Databases
    • Bottom
      •     requests – requests that created the database services and the database instances

Screen Shot 2013-06-09 at 11.08.02 PM

Figure 44. To clone a database, choose the “Request” then “Database” menu.

Screen Shot 2013-06-09 at 11.08.40 PM

Figure 45. From the list choose a Self Service Template. In this case “SOEDB Service Template”

Options are

  •   RMAN backups which are full clones
  •   empty databases
  •   snap clone which are thin clones

Screen Shot 2013-06-09 at 11.09.42 PM

Figure 46. Fill out the clone Service Request

request wizard asks for

  • request name
  • select zone – collection of servers
  • select a start and end time
  • provide a user name and password, new user and password

Users do not get system access to the databases but instead get a slightly less privilege user who becomes the owner of the database

Hit Submit

Screen Shot 2013-06-09 at 11.10.25 PM

Figure 47. Shows new clone database

 

References

http://www.youtube.com/watch?v=J7fnfLS5Dxg&feature=youtu.be - setup

http://www.youtube.com/watch?v=9VK1z6nU1PU – provisioning

 

SQL*Net Compression – 2

I wrote a note a few years ago about SQL*Net compression (this will open in a new window so that you can read the posts concurrently), showing how the order of the data returned by a query could affect the amount of network traffic. An example in the note demonstrated, using autotrace statistics that the number of bytes transferred could change dramatically if you sorted your return data set. At the time I asked, and postponed answering, the question: “but how come the number of SQL*Net round trips has not changed ?”

A couple of weeks ago someone asked me if I had ever got around to answering this question – and I hadn’t. So I started by writing an answering comment, then decided it was getting a little long so transferred it to a separate blog note, and here it is.

We can start with the two sets of stats:

Statistics
----------------------------------------------------------
          4  recursive calls
          0  db block gets
        143  consistent gets
          0  physical reads
          0  redo size
     425479  bytes sent via SQL*Net to client
        494  bytes received via SQL*Net from client
         11  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      10000  rows processed

Statistics
----------------------------------------------------------
          4  recursive calls
          0  db block gets
        133  consistent gets
          0  physical reads
          0  redo size
      79287  bytes sent via SQL*Net to client
        494  bytes received via SQL*Net from client
         11  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
      10000  rows processed

Basically, the answer comes from interpreting the names of the statistics correctly.

The “SQL*Net roundtrips to/from client” is the number of pairs of waits for “SQL*Net Message from client”/“SQL*Net Message to client” – which is controlled (in this example) very strongly by the array fetch size and the total number of rows (Note: we have arraysize = 1000 and rows returned = 10000 … the number of round trips is very close to 10,000 / 1,000.) This is exactly how the statistic is described in the SQL*Plus User’s Guide and Reference which says: “Total number of Oracle Net messages sent to and received from the client” – the trouble is, this description is easy to mis-understand.

If we enable extended SQL tracing (event 10046 / dbms_monitor) we would see that there is more activity going on than is reported by the autotrace statistics:
Here’s a sample of waits from the first query (retested on 11.2.0.2) with one blank line inserted for clarity:

WAIT #140199974133928: nam='SQL*Net message from client' ela= 319 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699605359
WAIT #140199974133928: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699605402
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 34 driver id=1650815232 #bytes=8155 p3=0 obj#=-40016366 tim=1370786699605478
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 4 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699605528
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699605565
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699605603
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699605640
FETCH #140199974133928:c=0,e=262,p=0,cr=7,cu=0,mis=0,r=1000,dep=0,og=1,plh=3617692013,tim=1370786699605654

WAIT #140199974133928: nam='SQL*Net message from client' ela= 319 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699605993
WAIT #140199974133928: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699606024
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 19 driver id=1650815232 #bytes=8155 p3=0 obj#=-40016366 tim=1370786699606082
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699606127
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699606163
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699606199
WAIT #140199974133928: nam='SQL*Net more data to client' ela= 3 driver id=1650815232 #bytes=8148 p3=0 obj#=-40016366 tim=1370786699606234
FETCH #140199974133928:c=0,e=230,p=0,cr=7,cu=0,mis=0,r=1000,dep=0,og=1,plh=3617692013,tim=1370786699606247

And here’s the equivalent on the 2nd query:

WAIT #140199974133928: nam='SQL*Net message from client' ela= 1187 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699689809
WAIT #140199974133928: nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699689893
FETCH #140199974133928:c=0,e=540,p=0,cr=0,cu=0,mis=0,r=1000,dep=0,og=1,plh=2148421099,tim=1370786699690421

WAIT #140199974133928: nam='SQL*Net message from client' ela= 1096 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699691585
WAIT #140199974133928: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-40016366 tim=1370786699691666
FETCH #140199974133928:c=0,e=506,p=0,cr=0,cu=0,mis=0,r=999,dep=0,og=1,plh=2148421099,tim=1370786699692161

My session data unit (SDU) has defaulted to roughly 8KB and if you convert the driver id (1650815232) to hexadecimal you get 0×62657100, which you can recognise as BEQ, the local (bequeath) connection.

When there is little de-duplication available in the data order (first example) the 1,000 rows I fetch on each fetch call total about 48KB and I end up doing one “SQL*Net message to client” with five follow-up “SQL*Net more data to client”. (The bytes parameter for the “SQL*Net message to client” always shows 1 – Cary Millsap, I think, has pointed out that this is a long-standing error, so I’m assuming the call would be passing 8KB like all the other calls).

When I get maximum de-duplication in the data the entire 1,000 rows manages to compress down to less than the 8KB that’s available and all I see is the single “SQL*Net message to client”. with no calls for more data.

So – SQL*Net roundtrips isn’t measuring what you think – based on an informal reading of the description you and I might expect the first test to report far more round trips than the second, but that’s not the way that Oracle reports it, it’s literally counting the number of specific pairs of waits for “SQL*Net message to/from client” – the “more data” trips don’t get counted. As a result the same number of round-trips can report a hugely (and arbitrarily) different number of bytes.

Footnote:

I’ve only just noticed that I seem to be the only person that uses SQL*Plus. When I searched at tahiti.oracle.com to find the documented description of this statistics – I notice that the manual title was “SQL*Plus User’s Guide and Reference”; if there were more users it would be “SQL*Plus Users’ Guide and Reference” (check the apostrophe). Since I know that I use SQL*Plus that must mean no-one else does ;) I guess everyone else uses GUIs nowadays.

10053

I thought I’d try to spend some of today catching up on old comments – first the easier ones, then the outstanding questions on Oracle Core.
The very first one I looked at was about pushing predicates, and the specific comment prompted me to jot down this little note about the 10053 trace file (the CBO trace).

In “real-life” I don’t often look at 10053 trace files because they tend to be very long and messy and usually you can see all you need from an execution plan – perhaps through SQL Monitoring, or possible with the rowsource execution stats enabled. However there are questions that can sometimes be resolved very quickly by a simple text search of a trace file; questions of the form: “why isn’t the optimizer using feature X”, for example: “why didn’t the optimizer use predicate push for this view?” If you have a question of this type, then search the trace file for the word “bypass” – you may find comments like the following:

        JPPD:     JPPD bypassed: Outer query references right side of outer join
        JPPD:     JPPD bypassed: View has a single or group set function.
        JPPD:     JPPD bypassed: OLD_PUSH_PRED hint specified
        JPPD:     JPPD bypassed: User-defined operator
        JPPD:     JPPD bypassed: Push-down not enabled
        JPPD:     JPPD bypassed: View not on right-side of outer join
        JPPD:     JPPD bypassed: View contains a group by.
        JPPD:     JPPD bypassed: View contains a window function.
        JPPD:     JPPD bypassed: View contains a MODEL clause.
        JPPD:     JPPD bypassed: View contains a DISTINCT.
        JPPD:     JPPD bypassed: View contains a rownum reference.
        JPPD:     JPPD bypassed: START WITH query block.
        JPPD:     JPPD bypassed: View is a set query block.
        JPPD:     JPPD bypassed: Negative hint found
        JPPD:     JPPD bypassed: Outline does not contain hint
        JPPD:     JPPD bypassed: Base table missing statistics
        JPPD:     JPPD bypassed: Remote table
        JPPD:     JPPD bypassed: Security violation
        JPPD:     JPPD bypassed: Possible security violation.
        JPPD bypassed: Table level NO_PUSH_PRED hint.
        JPPD bypassed: Table-level hint.
        JPPD bypassed: Query block NO_PUSH_PRED hint.
        JPPD bypassed: View semijoined to table
        JPPD bypassed: Contained view has no table in fro list.
        JPPD bypassed: View contains a START WITH query block.
        JPPD bypassed: Does not contain a view.
        JPPD bypassed: Found branches of different types

According to a note that I’ve got with this list, by the way, I used “strings -a” on the Oracle executable for 10.2.0.1 to generate it. It’s probably about time I did the same for newer versions of Oracle.

Footnote:

My plans for addressing comments have been disrupted somewhat. Just as I published this note, an email holding two 10053 trace files arrived. (The author had asked before sending them, and I was sufficiently curious that I had agreed to take a quick look). So I’ve spent most of the last hour doing what I’ve just said I hardly ever to – looking at 10053 trace files.

The question was “why does this query run serially if I have a particular scalar subquery in the select list, but run parallel if I replace it with a function”.  The immediate answer, after I’d seen the query and thought about it for a bit, was: “because the manuals (10.2 – the relevant version) say that you don’t parallelize if you have scalar subqueries in the select list”; but this changed the question to: “why is it just this one scalar subquery that causes serialization when the other two scalar subqueries don’t”. Of the three scalar subqueries, only one of them cause the query to serialize.

The answer to that question is a little more subtle – and I’ll blog about it when I can find time to model the scenario.

Hyperion (Dan Simmons)…

Hyperion is the first in the Hyperion Cantos series by Dan Simmons.

What a great Sci-Fi book! A group of seven travellers are on a pilgrimage to Hyperion. Six of the seven tell the stories of how they came to be there, with the sixth story kind-of linking things together. There is no real conclusion to the story as the next book carries on the story from the point the first one ends. It was definitely written as a series!

The timeline jumps around quite a bit through the book, but in a good way. It’s not done in a confusing way.

Definitely worth a look for any Sci-Fi readers out there.

Cheers

Tim…

PS. I put the authors name in the title, for fear of confusing people into thinking this was about work. :)


Hyperion (Dan Simmons)… was first posted on June 8, 2013 at 5:25 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

can you lock part of a table ?

I should stress that this is all well covered in the documentation, but it was something I stumbled upon just recently.

Its common knowledge that you can lock a table:

SQL> create table T 
  2   ( x int ) 
  3  partition by range ( x ) 
  4  ( 
  5    partition p1 values less than (10), 
  6    partition p2 values less than (20), 
  7    partition p3 values less than (30) 
  8  );

Table created.

SQL> lock table T in exclusive mode;

Table(s) Locked.

However, you can also selectively lock just partitions of a table as well, using the standard partition clause, for example:

Session 1

SQL> lock table T partition ( p1 ) in exclusive mode;

Table(s) Locked.

whilst the following is still possible in session 2:

SQL> insert into T values (20);

1 row created.

Demystifying Big Data for Oracle Professionals

Ever wonder about Big Data and what exactly it means, especially if you are already an Oracle Database professional? Or, do you get lost in the jargon warfare that spews out terms like Hadoop, Map/Reduce and HDFS? In this post I will attempt to explain these terms from the perspective of a traditional database practitioner but getting wet on the Big Data issues by being thrown in the water. I was inspired to write this from the recent incident involving NSA scooping up Verizon call data invading privacy of the citizens.

It's not a news anymore that many people are reacting in different ways to the news of National Security Agency accessing the phone records of Verizon - cellphone and land lines - to learn who called who. But one thing is common - all these reactions are pretty intense. While the debate lingers on whether the government overstepped its boundaries in accessing the records of private citizens or whether it was perfectly justified in the context of the threats our country is facing right now - and there will be a debate for some time to come - I had a different thought of my own. I was wondering about the technological aspects of the case. The phone records must be massive. What kind of tools and technologies the folks in the "black room", a.k.a Room 641A must have used to collate and synthesize the records into meaningful intelligence - the very task NSA is supposed to gather and act upon? Along with anything it piques the interest from an engineering point of view.

And that brings the attention to the aspect of computation involving massive amounts of data. It's one thing to slice and dice a finite, reasonable amount of dataset; but in the case of the phone records, and especially collated with other records to identify criminal or pseudo-criminal activities such as financial records, travel records, etc., the traditional databases such as Oracle and DB2 likely will not scale well. But the challenge is not just in the realm of romanticized espionage; it's very much a concern for purely un-romantic corporations who, among other things, want to track customer behavior to fine tune their service and product offerings. Well, it's espionage of a slightly different kind. Perhaps the sources of data are different - website logs, Facebook feeds as opposed to phone records; but the challenges are the same - how to synthesize enormous amounts of seeming unrelated data into meaningful intelligence. This is the challenge of the "Big Data".

Meet the V's

Several years ago, Yahoo! faced the same issue - how to present the attractiveness the webpages it puts on its portal and analyze the pattern of clicking by users to attract advertisers to put relevant ads on their pages. Google had a similar challenge of indexing the entire World Wide Web in its servers so that it present search results very, very quickly. Both of these issues represent the issues that others probably didn't face earlier. Here are the relatively unique aspects of this data, which are known as the "Three V's of Big Data".

  • Volume - the sheer mass of the data made it difficult, if not impossible, to sort through them
  • Velocity - the data was highly transient. Website logs are relevant only for that time period; for a different period it was different
  • Variety - the data was not pre-defined and not quite structured - at least not the way we think of structure when we think of relational databases

Both these companies realized they are not going to address the challenges using the traditional relational databases, at least not in the scale they wanted. So, they developed tools and technologies to address these very concerns. They took a page from the super-computing paradigm of divide and conquer. Instead of dissecting the dataset as a whole, they divided it into smaller chunks to be processed by hundreds, even thousands of small servers. This approach solved three basic, crippling problems:

  1. There was no need to use large servers, which typically costs a lot more than small servers
  2. There was a built-in data redundancy since the data was replicated between these small servers
  3. But the most important, it could scale well, very well simply by adding more of those small servers

This is the fundamental concept that have rise to Hadoop. But before we cover that, we need to learn about another important concept.

Name=Value Pairs

A typical relational database works by logically arranging the data into rows and columns. Here is an example. You decide on a table design to hold your customers, named simply CUSTOMERS. It has the columns CUST_ID, NAME, ADDRESS, PHONE. Later, your organization decides to provide some incentives to the spouses as well and so you added another column - SPOUSE.

Everything was well, until the time you discovered customer 1 and spouse are divorced and there is a new spouse now. However the company decides to keep the names of the ex-spouses as well, for marketing analytics. Like the right relational application, you decide to break SPOUSE away from the main table and create a new table - SPOUSES, which is a child of CUSTOMERS, joined by CUST_ID. This requires massive code and database changes; but you survive. Later you had the same issue with addresses (people have different addresses - work, home, vacation, etc.) and phone numbers (cell phone, home phone, work phone, assistant's phone, etc.). So you decide to break them into different tables as well. Again, code and database changes. But the changes did not stop there. You had to add various tables to record hobbies, associates, weights, dates of birth - the list is endless. Every thing you record requires a database change and a code change. But worse - not all the tables will be populated for every customer. In your company's quest to build a 360 degree view of the customer, you collect some information; but there is no guarantee that all the data points will be gathered. you are left with sparse tables. Now, suddenly, someone says there is yet another attribute required for the customer - professional associations. So, off you go - build yet another table, followed by change control, code changes to incorporate that.

If you look at the scenario above, you will find that the real issue is trying force a structure around a dataset that inherently unstructured - akin to a square peg in a round hole. The lack of structure of the data is what makes it agile and useful; but the lack of structure is also what makes it difficult in a relational database that demands structure. This is the primary issue if you wan tto capture social media data - Twitter feeds, Facebook updates, LinkedIn updates and Pinterest posts. It's impossible to predict in advance, at least accurately, the exact information you will expect to see in them. So, putting a structure around the data storage not only makes life difficult for everyone - the DBAs will constantly need to alter the structures and the developers/designers will constantly wait for the structure to be in the form they want - slowing down capture and analysis of data.

So, what is the solution. If you think about it, think about how we - human beings - process information. Do we parse information in form of rows in some table? Hardly. We process and store information by associations. For instance, let's say I have a friend John. I probably have nuggets of information like this:

#fff2cc;">Last Name = Smith
#fff2cc;">Lives at = 13 Main St, Anytown, USA
#fff2cc;">Age = 40
#fff2cc;">Birth Day = June 7th
#fff2cc;">Wife = Jane
#fff2cc;">Child = Jill
#fff2cc;">Jill goes to school = Top Notch Academy
#fff2cc;">Jill is in Grade = 3

... and so on. Suppose I meet another person - Martha- who tells me that her child also goes to Grade 3 in Top Notch Academy. My brain probably goes through a sequence like this:

#fce5cd;">Search for "Top Notch Academy"
#fce5cd;"> Found it. It's Jill
#fce5cd;"> Search for Jill.
#fce5cd;">   Found it. She is child of John
#fce5cd;">   Who is John's wife?
#fce5cd;">   Found it. It's Jane.
#fce5cd;">     Where do John and Jill live? ...

And finally, after this processing is all over, I say to Martha as a part of the coversation - "What a coincidence! Jill - the daughter of my friends John and Jane Smith goes there are well. Do you know them?". "Yes, I do," replies Martha. "In fact they are in the same class, that of Mrs Gillen-Heller".

Immediately my brain processed this new piece of information and filed the data as:

#fff2cc;">Jill's Teacher = Mrs. Gillen-Heller
#fff2cc;">Jane's Friend = Martha
#fff2cc;">Martha's Child Goes to = ...

Months later, I meet with Jane and mention to her that I met Martha whose child went to Mrs. Gillen-Heller's class - the same one as Jill. "Glad you met Martha,", Jane says. "Oh, Jill is no longer in that class. Now she is in Mr. Fallmeister's class."

Aha! My brain probably stored that information as:

#fff2cc;">Jill's former teacher = Mrs. Gillen-Heller

And it updated the already stored information:

#fff2cc;">Jill's Teacher = Mr. Fallmeister

This is called storing by a name=value pair. You see I stored the information as a pair of property and it value. As information goes on, I keep adding more and more pairs. When I need to retrieve information, I just get the proper property and by associations, I get all the data I need. But the storing of data by name=value pairs gives me enormous flexibility in storing all kinds of information without modifying any data structures I may currently have.

This is also how the Big Data is tamed for processing. Since the data coming of Twitter, Facebook, LinkedIn, Pinterest, etc. is impossible to categorize in advance, it will be practically impossible to put it all in the relational format. Therefore, a name=value pair type storage is the logical step in compiling and collating the data. the name is also known as "key"; so the model is sometimes called key-value pair. The value doesn't have to have a datatype. In fact it's probably a BLOB; so anything can go in there - booking amount, birth dates, comments, XML documents, pictures, audio and even movies. It provides an immense flexibility in capturing the information that is inherently unstructured.

NoSQL Database

Now that you know about name value pairs, the next logical question you may have is - how do we store these? Our thoughts about databases are typically colored by our long-standing association with relational databases, making them almost synonymous with the concept of database. Before relational databases were there, even as a concept, big machines called mainframes ruled the earth. The databases inside them were stored in hierarchical format. One such database from IBM was IMS/DB, which was hierarchical. Later, when relational databases were up and coming, another type of database concept - called a network database - was developed to compete against it. An example of that category was IDMS (now owned by Computer Associates) developed for mainframes. The point is, relational databases were not the answer to all the questions then; and it is clear that they are not now either.

This leads to the development a different type of database technologies based on the the key-value model. Relational database systems are queried by SQL language, which I am sure is familiar to almost anyone reading this blog post. SQL is a set-oriented language - it operates on sets of data. In the key-value pair mode, however, that does not work anymore. Therefore these key-value databases are usually known as NoSQL, to separate them from the relational SQL-based counterparts. Since their introduction, some NoSQL databases actually support SQL, which is why "NoSQL" is not a correct term anymore. Therefore sometimes it is referred to as "Not only SQL" databases. But the point is that their structure is not dependent on relational. But how exactly the data is stored is usually left to the implementer. Some examples are MongoDB, Dynamo, Big Table (from Google) etc.

I would stress here that almost any type of non-relational database can be classified as NoSQL; not just the name-value pair models. For instance, Object Store, an object database is also NoSQL. But for this blog post, I am assuming only key-value pair database as the NoSQL one.

Map/Reduce

Let's summarize what we have learned so far:

  1. The key-value pair model in databases offer flexibility in data storage without the need for a predefined table structure
  2. The data can be distributed across many machines where they are independently processed and then collated.

When the system gets a large chunk of data, e.g. a Facebook feed, the first task is to break it down to these keys and their corresponding values. After that the values may be collated for a summary result. The process of dividing the raw data into meaningful key-value pairs is known as "mapping". Later combining the values to form summaries, or just eliminating the noise from the data to extract meaningful information is known as "reducing". For instance you see "Name" and "Customer Name" in the keys. They mean the same thing; so you reduce them to a single key value - "Name". These are almost always used together; hence the operation is known as Map/Reduce.

Here is a very rudimentary but practical example of Map/Reduce. Suppose you get Facebook feeds and you are expected to find out the total of likes for our company's recent post. Facebook feed comes in the form of a massive dataset. The first task is to divide that among many servers - a principle described earlier to make the process scale well. Once the dataset is divided, each machine run some code to extract and collate the information and then present the data to some central coordinator. to collate for the final time. Here is a pseudo-code for the process for each server doing the processing on a subset of data:

begin
  get post
  while (there_are_remaining_posts) loop
     extract status of "like" for the specific post
     if status = "like" then 
         like_count := like_count + 1
     else
         no_comment := no_comment + 1
     end if
  end loop
end

Let's name this program counter(). Counter runs on all the servers, which are called Nodes. As shown in the figure, there are three nodes. The raw dataset is  divided into three sub-datasets which are then fed to each of the three Nodes. A copy of the subdataset is kept another server as well. That takes care of redundancy. Each node perform their computation, send their results to an intermediate result set where they are collated.

Map/Reduce Processing

How does this help? It does; in many ways. Let's see:

(1) First, since the data is stored in chunks and the copy of a chunk is stored in a different node, there is built-in redundancy. There is no need to protect the data being fed since there is a copy available elsewhere.
(2) Second, since the data is available elsewhere, if a node fails, all it needs to done is that some other nodes will pick up the slack. There is no need to reshuffle or restart the job.
(3) Third, since the nodes all perform task independently, when the datasize becomes larger, all you have to do is to add a new node. Now the data will be divided four ways instead of three and so will be processing load.

This is very similar to parallel query processes in Oracle Databases, with PQ servers being analogous to nodes.

There are two very important points to note here:

(1) The subset of data each node gets is not needed to be viewed by all the nodes. Each node gets its own set of data to be processed. A copy of the subset is maintained in a different node - making the simultaneous access to the data unnecessary. This means you can have the data in a local storage; not in expensive SANs. This is not only brings cost significantly down but may performs better as well due to a local access. As cost of Solid State Devices and flash-based storage plummets, it could also mean the that the storage cost per performance will be even better.
(2) The nodes need not be super fast. A relatively simple commodity class server is enough for the processing as opposed to a large server. Typically servers are priced for their use, e.g. a Enterprise class server with 32 CPUs is probably roughly equivalent in performance to eight 4-CPU blades. But the cost of the former is way more than 8 times the cost of the blade server. This model takes advantage of the cheaper computers by scaling horizontally; not vertically.

Hadoop

Now that you know how the processing data in parallel and using a concept called Map/Reduce allows you to shove in several compute intensive applications to dissect large amounts of data, you will often wonder - there are a lot of moving parts to be taken care of just to empower this process. In a monolithic server environment you just have to kick off multiple copies of the program. The Operating System does the job of scheduling these programs on the available CPUs, taking them off the CPU (paging) to roll in another process, prevent processes from corrupting each others' memory, etc. Now that these processes are occurring on multiple computers, there has to be all these manual processes to make sure they work. For instance, in this model you have to ensure that the jobs are split between the nodes reasonably equally, the dataset is split equitably, the queue for feeding data and getting data back from the Map/Reduce jobs are properly maintained, the jobs fail over in case of node failure, so on. In short, you need an operating system of operating systems to manage all these nodes as a monolithic processor.

What would you do if this operating procedures were already defined for you? Well, that would make things really easy, won't it? you can then focus on what you are good at - developing the procedures to slice and dice the data and derive intelligence from it. This "procedure", or the framework is available, and it is called Hadoop. It's an open source offering, similar to Mozilla and Linux; no single company has exclusive ownership of it. However, many companies have adopted it and evolved it into their offerings, similar to Linux distributions such as Red Hat, SuSe and Oracle Enterprise Linux. Some of those companies are Cloudera, Hortonworks, IBM, etc. Oracle does not have a Hadoop distribution. Instead it licenses the one from Cloudera for its own Big Data Appliance. The Hadoop framework runs on all the nodes of the cluster of nodes and acts as the coordinator.

A very important point to note here that Hadoop is just a framework; not the actual program that performs Map/Reduce. Compare that to the operating system analogy; an OS like Windows does not offer a spreadsheet. You will need to either develop or buy an off the shelf product such as Excel to have that functionality. Similarly, Hadoop offers a platform to run the Map/Reduce programs that you develop and you put that logic in the code what you "map" and how you "reduce".

Remember another important advantage you had seen in this model earlier - the ability to replicate data between multiple nodes so that the failure of a single node does not cause the processing to be abandoned. This is offered through a new type of filesystem called Hadoop Distributed Filesystem (HDFS). HDFS, which is a distributed (and not a clustered) filesystem, by default has 3 copies of data on three different nodes - two on the same rack and the third on a different rack. The nodes communicate to each other using a HDFS- specific protocol that is built on TCP/IP. The nodes are aware of the data present on the other nodes, which is precisely what allows Hadoop job scheduler to divide the work among the nodes. Oh, by the way, HDFS is not absolutely required for Hadoop; but as you can see, HDFS is the only way for Hadoop to know which node has what data for smart job scheduling. Without it, the division of labor will not be as efficient.

Hive

Now that you learned how Hadoop fills a major void for computations on a massive dataset, you can't help but see the significance for datawarehouses where massive datasets are common. Also common are jobs that churn through this data. However, there is a little challenge. Remember, NoSQL databases mentioned earlier? That means they do not support SQL. To get the data you have to write a program using the APIs the vendor supplies. This may reek of COBOL programs of yesteryear where you had to write a program to get the data out, making it inefficient and highly programmer driven (although it did strut up job-security, especially during the Y2K transition times). The inefficacy of the system gave rise to 4th Generation Languages like SQL, which brought the power of queries to common users, ripping the power away from programers. In other words, it brought the data and its users closer, reducing the role the middleman significantly. In datawarehouses, it was especially true since the power users issued queries to after getting the result from the previous queries. It was like a conversation - ask a question, get the answer, formulate your next question - and so on. If the conversation were dependent on writing programs, it would have been impossible to be effective.

With that in mind, consider the implications of lack of SQL in these databases highly suitable for datawarehouses. This requirement to write a program to get the data everytime would have taken it straight to the COBOL-days. Well, not to worry, Hadoop has another product that allows a SQL-like language called HiveQL. Just as users could query relational databases with SQL very quickly, HiveQL allows users to get the data for analytical processing directly. It was initially developed at Facebook.

Comparing to Oracle RAC

When we talk about clustering in Hadoop, you may not help wonder - shouldn't the same functionality be provided by Oracle Real Application Cluster? Well, the short answer is - a big resounding NO. Oracle RAC combines the power of multiple nodes in the cluster which communicate with one another and transfer data (cache fusion); but that's where the similarity ends. The biggest difference is the way the datasets are accessed. In RAC, the datasets are common, i.e. they must be visible to all the nodes of the cluster. In Hadoop, the datasets are specific to the individual nodes, which allows them to be local. The filesystems in RAC can't be local; they have to be clustered or available globally, either by a clustered filesystem, shared volumes, clustered volume managers (such as ASM)  or by NFS mounting. In Hadoop the local files are replicated to other nodes, which means there is no reason to create a RAID lever protection at the storage level. ASM does provide a software level mirroring, which may sound similar to Hadoop's replication; but remember, ASM's mirrors are not node specific. The mirrored copies of ASM must be visible to all the nodes. There is a preferred node concept in ASM; but that simply means that data is read by a specific node from one mirror copy. The mirror copies can't be local; they must be be globally visible.

Besides, Oracle RAC if for a relational database. The Hadoop cluster is not one. There are traits such as transnational integrity, multiple concurrent writes that are innate features of any modern database system. In Hadoop, these things are not present and not really necessarily. So while both are technically databases, a comparison may not be fair to either Hadoop or RAC. They are like apples and tomatoes. (Tomato is technically a fruit, just in case you are wondering about this analogy).

The Players

So, who are the players for this new branch of data processing? The major players are show below with small description. This list is by no means exhaustive. It simply is a collection of companies and products I have studied.

  • Cloudera - they have their distribution called, what else, Cloudera Distribution for Hadoop (CDH). But perhaps the most impressive from them is Impala - a real-time SQL like interface to query data from the Hadoop cluster.
  • Hortonworks - may of the folks who founded this company came from Google and Yahoo! where they built or added to the building blocks of Hadoop
  • IBM - they have a suite called Big Insights which has their distribution of Hadoop. This is one of the very few companies who offer both the hardware and software. the most impressive feature from IBM is a product called Streams that can mine data frm a non-structured stream like Facebook in realtime and send alerts and data feeds to other systems.
  • Dell - like IBM they also have a hardware/software combination running a licensed version of Cloudera, along with Pentaho Business Analytics
  • EMC
  • MapR

Conclusion

If the buzzing of the buzz-words surrounding any new technology annoy you and all you get is tons of websites on the topic but not a small consolidated compilation of terms, you are just like me. I was frustrated by the lack of information in a digestible form on these buzzwords that are too important to ignore but would take too much time to understand fully. This is my small effort to bridge that gap and get you going on your quest for more information. If you have 100's or 1000's of questions after reading this, I would congratulate myself - that is precisely what my objective was. For instance how HiveQL differs from SQL or how Map/Reduce jobs are written - these are questions that should be flying in your mind now. My future posts will cover them and some more topics like HBase, Zookeeper, etc. that will unravel the mysteries around the technology that is going to be commonplace in the very near future.

Welcome aboard. I wish you luck in learning. As always, your feedback will be highly appreciated.

DataGuard FastStart Failover configuration with Oracle Wallet

Last two months were very busy for me with lot of work and moving house as well. There is a lot of thoughts in my head I want to write about but some of them have to wait until my lab will be online again.Recently I was working on Oracle DataGuard configuration with FastStart Failover and this is what I want to share with you today.

FastStart Failover is a good solution for automatic promoting standby database to be primary and allow application to keep running in case of primary database failure. I don't want to describe whole solution here but want to focus on Observer problems I had recently.

Observer is a part of DataGuard Broker running on 3rd server and it is used to prevent a brain split between primary and standby database. Observer is started by DGMGRL program and it required a sys user credentials. Customer doesn't want to keep sys password in scripts to start and stop observer so Oracle Wallet looks like a good solution to address this issue.

Configuration overview:
- database name - TESTDB
- unique database names - TESTDB_PRIMARY and TESTDB_STANDBY

Wallet has been created and sys user with alias for primary and standby database has been added to it.
I was able to connect to both servers:

[oracle@orasvr3 ~]$ mkstore -wrl /home/oracle/testwallet -listCredential
Oracle Secret Store Tool : Version 11.2.0.3.0 - Production
Copyright (c) 2004, 2011, Oracle and/or its affiliates. All rights reserved.
Enter wallet password:
List credential (index: connect_string username)
2: TESTDB_STANDBY sys
1: TESTDB_PRIMARY sys
[oracle@orasvr3 ~]$ dgmgrl
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect /@TESTDB_PRIMARY
Connected.
DGMGRL> connect /@TESTDB_STANDBY
Connected.
DGMGRL>

Observer has been started and I begun failover tests. Here is a first attempt:

12:09:25.21  Thursday, May 30, 2013
Initiating Fast-Start Failover to database "TESTDB_PRIMARY"...
Performing failover NOW, please wait...
Failover succeeded, new primary is "TESTDB_PRIMARY"
12:09:31.78  Thursday, May 30, 2013

12:10:47.91  Thursday, May 30, 2013
Initiating reinstatement for database "TESTDB_STANDBY"...
Reinstating database "TESTDB_STANDBY", please wait...
Operation requires shutdown of instance "TESTDB" on database "TESTDB_STANDBY"
Shutting down instance "TESTDB"...
ORA-01031: insufficient privileges

Warning: You are no longer connected to ORACLE.

Please complete the following steps and reissue the REINSTATE command:
        shut down instance "TESTDB" of database "TESTDB_STANDBY"
        start up and mount instance "TESTDB" of database "TESTDB_STANDBY"

12:11:06.36  Thursday, May 30, 2013

Looks like Observer was unable to shutdown TESTDB_STANDBY database. This is typical error if there is no DGMGRL services registered in listeners but I was sure network configuration is fine as it was working perfectly fine when observer has been started with user and password in connection string. So this problem has to be related to Oracle Wallet only. There is a not well know switch for DGMGRL program to enable debug mode and have some more information about issue and I decided to use it. First of all I have started with switchover tests as there need this same interaction with starting and stopping instance like failover and are faster to perform.

Here is a output with debug option enabled:

[oracle@orasvr3 ~]$ dgmgrl -debug
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production

Copyright (c) 2000, 2009, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect /@TESTDB_PRIMARY
[W000 05/30 12:25:02.04] Connecting to database using TESTDB_PRIMARY.
[W000 05/30 12:25:02.09] Checking broker version [BEGIN :version := dbms_drs.dg_broker_info('VERSION'); END;].
[W000 05/30 12:25:02.10] Broker version is '11.2.0.3.0'
Connected.

DGMGRL> switchover to TESTDB_PRIMARY;
Performing switchover NOW, please wait...
New primary database "TESTDB_PRIMARY" is opening...
Operation requires shutdown of instance "TESTDB" on database "TESTDB_STANDBY"
Shutting down instance "TESTDB"...
[W000 05/30 12:26:23.24] Connecting to database using (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated))).
ORA-01031: insufficient privileges

Warning: You are no longer connected to ORACLE.

Please complete the following steps to finish switchover:
        shut down instance "TESTDB" of database "TESTDB_STANDBY"
        start up instance "TESTDB" of database "TESTDB_STANDBY"

  It is trying to connect to TESTDB_STANDBY using following connection description

(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated)))

but this one doesn't exist in wallet. Service _DBG is created by DMON process and it is registered in default listener or listeners specified in local_listener parameter.
I have added above connection string to Oracle Wallet and completed all manual steps before next try. Here is a output for second switchover run:

[oracle@orasvr3 ~]$ dgmgrl -debug
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect /@TESTDB_PRIMARY
[W000 06/06 12:50:31.01] Connecting to database using TESTDB_PRIMARY.
[W000 06/06 12:50:31.05] Checking broker version [BEGIN :version := dbms_drs.dg_broker_info('VERSION'); END;].
[W000 06/06 12:50:31.06] Broker version is '11.2.0.3.0'
Connected.
DGMGRL> switchover to TESTDB_PRIMARY;
Performing switchover NOW, please wait...
New primary database "TESTDB_PRIMARY" is opening...
Operation requires shutdown of instance "TESTDB" on database "TESTDB_STANDBY"
Shutting down instance "TESTDB"...
[W000 06/06 12:50:45.95] Connecting to database using (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated))).
ORACLE instance shut down.
Operation requires startup of instance "TESTDB" on database "TESTDB_STANDBY"
Starting instance "TESTDB"...
[W000 06/06 12:50:47.09] Connecting to database using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))).
ORA-01031: insufficient privileges
Warning: You are no longer connected to ORACLE.
Please complete the following steps to finish switchover:
        start up and mount instance "TESTDB" of database "TESTDB_STANDBY"

Still there is manual step to do but this time database was shut down automatically, and Observer was unable to connect to start up database. There is other connection string

(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))).

This one can be found here

DGMGRL> show database verbose TESTDB_STANDBY;

Database - TESTDB_STANDBY

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   0 seconds
  Apply Lag:       0 seconds
  Real Time Query: OFF
  Instance(s):
    TESTDB

  Properties:
    DGConnectIdentifier             = 'TESTDB_STANDBY'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'SYNC'
    DelayMins                       = '0'
    Binding                         = 'optional'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '10'
    RedoCompression                 = 'DISABLE'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '4'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = ''
    LogFileNameConvert              = 'a, a'
    FastStartFailoverTarget         = 'TESTDB_PRIMARY'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    SidName                         = 'TESTDB'
    StaticConnectIdentifier         = '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED)))'
    StandbyArchiveLocation          = '/oralogs/arch/TESTDB'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = 'arch_%t_%s_%r.arc'
    TopWaitEvents                   = '(monitor)'

Database Status:
SUCCESS

It has been added to Oracle Wallet as well and now my wallet looks like this

[oracle@orasvr3 ~]$ mkstore -wrl /home/oracle/testwallet -listCredential
Oracle Secret Store Tool : Version 11.2.0.3.0 - Production
Copyright (c) 2004, 2011, Oracle and/or its affiliates. All rights reserved.
Enter wallet password:
List credential (index: connect_string username)
6: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))) sys
5: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr1)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_PRIMARY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))) sys
4: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated))) sys
3: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr1)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_PRIMARY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated))) sys
2: TESTDB_STANDBY sys
1: TESTDB_PRIMARY sys

After manual completion of required steps I run switchover again and this time it completed without any errors

DGMGRL> switchover to TESTDB_PRIMARY;
Performing switchover NOW, please wait...
New primary database "TESTDB_PRIMARY" is opening...
Operation requires shutdown of instance "TESTDB" on database "TESTDB_STANDBY"
Shutting down instance "TESTDB"...
[W000 05/30 12:36:51.39] Connecting to database using (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGB)(INSTANCE_NAME=TESTDB)(SERVER=dedicated))).
ORACLE instance shut down.
Operation requires startup of instance "TESTDB" on database "TESTDB_STANDBY"
Starting instance "TESTDB"...
[W000 05/30 12:36:52.54] Connecting to database using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))).
[W000 05/30 12:36:52.58] Checking broker version [BEGIN :version := dbms_drs.dg_broker_info('VERSION'); END;].
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 363 Serial number: 5

ORACLE instance started.
[W000 05/30 12:36:54.81] Connecting to database using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))).
[W000 05/30 12:36:54.84] Checking broker version [BEGIN :version := dbms_drs.dg_broker_info('VERSION'); END;].
[W000 05/30 12:36:54.84] Broker version is '11.2.0.3.0'
alter database  mount

Database mounted.
[W000 05/30 12:37:02.74] Connecting to database using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orasvr2)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=TESTDB_STANDBY_DGMGRL)(INSTANCE_NAME=TESTDB)(SERVER=DEDICATED))).
[W000 05/30 12:37:02.77] Checking broker version [BEGIN :version := dbms_drs.dg_broker_info('VERSION'); END;].
[W000 05/30 12:37:02.78] Broker version is '11.2.0.3.0'
Switchover succeeded, new primary is "TESTDB_PRIMARY"
DGMGRL>

Lesson learned:

  • FSFO Observer can work with Oracle Wallet
  • 3 entries are required per database, service name has to be set to : , and
  • DGMGRL entry can be checked in database configuration in Broker
  • DBG entry - use -debug mode if you can't find a proper connection string 

Hope it will help you with DataGuard Observer configuration.

regards,
Marcin

Same Plan

An interesting little problem appeared on the Oracle-L mailing list earlier on this week – a query ran fairly quickly when statistics hadn’t been collected on the tables, but then ran rather slowly after stats collection even though the plan hadn’t changed, and the tkprof results were there to prove the point. Here are the two outputs (edited slightly for width – the original showed three sets of row stats, the 1st, avg and max, but since the query had only been run once the three columns showed the same results in each case):


 Rows (max)  Row Source Operation
 ----------  ---------------------------------------------------
          0  UPDATE  CXT_FAKT_PROVISIONSBUCHUNG (cr=2039813 pr=3010 pw=0 time=47745718 us)
      15456   TABLE ACCESS FULL CXT_FAKT_PROVISIONSBUCHUNG (cr=1328 pr=1325 pw=0 time=40734 us cost=370 size=880992 card=15456)
      11225   VIEW  CXV_HAUPT_VU_SPARTE (cr=2038477 pr=1684 pw=0 time=47297497 us cost=10 size=4293 card=1)
      11225    SORT UNIQUE (cr=2038477 pr=1684 pw=0 time=47284436 us cost=10 size=824 card=1)
     126457     VIEW  (cr=2038167 pr=1669 pw=0 time=27753402 us cost=9 size=824 card=1)
     126457      WINDOW SORT (cr=2038167 pr=1669 pw=0 time=27667835 us cost=9 size=853 card=1)
     126457       WINDOW SORT (cr=2038167 pr=1669 pw=0 time=26884699 us cost=9 size=853 card=1)
     126457        NESTED LOOPS  (cr=2038167 pr=1669 pw=0 time=26342292 us)
     126457         NESTED LOOPS  (cr=1995241 pr=1615 pw=0 time=26192173 us cost=7 size=853 card=1)
     141581          NESTED LOOPS  (cr=642683 pr=1066 pw=0 time=3039331 us cost=5 size=499 card=1)
      11225           TABLE ACCESS BY INDEX ROWID POP_INFO (cr=22473 pr=32 pw=0 time=109767 us cost=2 size=16 card=1)
      11225            INDEX UNIQUE SCAN PK_POP_INFO (cr=11248 pr=4 pw=0 time=51796 us cost=1 size=0 card=1)(object id 1790009)
     141581           TABLE ACCESS BY INDEX ROWID TMP_VU_SPARTE (cr=620210 pr=1034 pw=0 time=2889850 us cost=3 size=483 card=1)
    1732978            INDEX RANGE SCAN IDX_TMP_VU_SPARTE (cr=140952 pr=204 pw=0 time=2094982 us cost=2 size=0 card=1)(object id 1795724)
     126457          INDEX RANGE SCAN IDX_TMP_VU_SPARTE (cr=1352558 pr=549 pw=0 time=23078816 us cost=1 size=0 card=1)(object id 1795724)
     126457         TABLE ACCESS BY INDEX ROWID TMP_VU_SPARTE (cr=42926 pr=54 pw=0 time=94791 us cost=2 size=354 card=1)

 Rows (max)  Row Source Operation
 ----------  ---------------------------------------------------
          0  UPDATE  CXT_FAKT_PROVISIONSBUCHUNG (cr=89894995 pr=1701 pw=0 time=318766975 us)
      15456   TABLE ACCESS FULL CXT_FAKT_PROVISIONSBUCHUNG (cr=1328 pr=1031 pw=0 time=46975 us cost=370 size=880992 card=15456)
      11225   VIEW  CXV_HAUPT_VU_SPARTE (cr=89893656 pr=670 pw=0 time=1553653734 us cost=11 size=4293 card=1)
      11225    SORT UNIQUE (cr=89893656 pr=670 pw=0 time=1553640071 us cost=11 size=419 card=1)
     126457     VIEW  (cr=89893656 pr=670 pw=0 time=1533733864 us cost=10 size=419 card=1)
     126457      WINDOW SORT (cr=89893656 pr=670 pw=0 time=1533646166 us cost=10 size=155 card=1)
     126457       WINDOW SORT (cr=89893656 pr=670 pw=0 time=1532847028 us cost=10 size=155 card=1)
     126457        NESTED LOOPS  (cr=89893656 pr=670 pw=0 time=1532238656 us)
    3501658         NESTED LOOPS  (cr=89167767 pr=665 pw=0 time=1529652013 us cost=8 size=155 card=1)
    5300707          NESTED LOOPS  (cr=1339312 pr=480 pw=0 time=9657987 us cost=5 size=81 card=1)
      11225           TABLE ACCESS BY INDEX ROWID POP_INFO (cr=22473 pr=32 pw=0 time=119070 us cost=2 size=16 card=1)
      11225            INDEX UNIQUE SCAN PK_POP_INFO (cr=11248 pr=3 pw=0 time=54707 us cost=1 size=0 card=1)(object id 1790009)
    5300707           TABLE ACCESS BY INDEX ROWID TMP_VU_SPARTE (cr=1316839 pr=448 pw=0 time=8359987 us cost=3 size=65 card=1)
    5300707            INDEX RANGE SCAN IDX_TMP_VU_SPARTE (cr=140971 pr=87 pw=0 time=3603882 us cost=2 size=0 card=1)(object id 1795724)
    3501658          INDEX RANGE SCAN IDX_TMP_VU_SPARTE (cr=87828455 pr=185 pw=0 time=1518016475 us cost=2 size=0 card=1)(object id 1795724)
     126457         TABLE ACCESS BY INDEX ROWID TMP_VU_SPARTE (cr=725889 pr=5 pw=0 time=1829196 us cost=3 size=74 card=1)

As you can see, the first run took 48 seconds (time=47,745,718 us in the first line) while the second execution took 319 seconds (time=318766975 us). If you check the execution plans carefully they appear to be the same plan and, in fact, the trace file showed that the two plans had the same plan hash value. Clearly, though, they do vastly different amounts of work – the most eye-catching detail, perhaps, is the way the bad plan blows the row count up to 5 million before collapsing it back to 126,000). What do you have to do to get the change in performance (and it’s a totally reproducible change) – create or drop stats: if you have stats on the tables you get a slow execution, if you delete the stats you get the fast execution.

So what’s the problem ? Look carefully at the plan(s) – they’re not actually the same plan, but you can’t see the difference you can only see clues that they must be different. Notice in the last four lines that you access the same table (TMP_VU_SPARTE) twice using the same index (IDX_TMP_VU_SPARTE) – when you collect stats you access the two tables in the opposite order, and it makes a difference to the work you do.

To demonstrate the point I’ve created a simplified model of the problem, based on some extra information supplied in the mail thread. The model requires a correlated update, based on a view which joins a table to itself, and range-based predicates. Here’s the data generation:


execute dbms_random.seed(0)

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	trunc(dbms_random.value(0,1e4)) contract,
	sysdate - 300 + trunc(dbms_random.value(0,300))	date_from,
	sysdate - 300 + trunc(dbms_random.value(0,300))	date_to,
	sysdate - 300 + trunc(dbms_random.value(0,300))	created,
	sysdate - 300 + trunc(dbms_random.value(0,300))	replaced
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;

create index t1_i1 on t1(contract, date_from, date_to);

create or replace view v1
as
select
	t1a.*
from
	t1	t1a,
	t1	t1b
where
	t1b.contract    = t1a.contract
and	t1b.date_from  <= t1a.date_from 
and	t1b.date_to    >  t1a.date_from
and	t1b.created    <  t1a.replaced 
and	t1b.replaced   >  t1a.created
;

rem
rem	We are going to update t2 from v1 using
rem	equality on all columns in the index
rem

create table t2
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	trunc(dbms_random.value(0,1e4)) 		contract,
	sysdate - 300 + trunc(dbms_random.value(0,300))	date_from,
	sysdate - 300 + trunc(dbms_random.value(0,300))	date_to,
	sysdate - 300 + trunc(dbms_random.value(0,300))	replaced
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5 ; 

begin 	dbms_stats.gather_table_stats(
 		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 =>'for all columns size 1'
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt	 =>'for all columns size 1'
	);

end;
/

I haven’t crafted my data particularly carefully and from the point of view of realism the content is a little bizarre (I’ve got “to” dates earlier than “from” dates, for example) – but all I’m interested in is getting the right sort of shape to demonstrate a point about the plan, I’m not trying to model the actual variation in activity.

I’ve created a view with a self-join that starts with equality on first column of the index I’ve created, but uses range-based predicates on the other columns in the table – that’s probably quite important as far as the real-world performance was concerned – partly because of the nature of the data (interesting skews when comparing “valid to/from dates”) and partly because Oracle is going to have to use its 5% guess for join selectivities for range based selectivities.

And now the update – two versions with their execution plans; starting with the plan where the correlated subquery visits t1 aliased as t1a first:

explain plan for
update t2 set
	replaced = (
		select
			/*+ leading(v1.t1a v1.t1b) */
			max(replaced)
		from	v1
		where
			v1.contract = t2.contract
		and	v1.date_from = t2.date_from
		and	v1.date_to = t2.date_to
	)
;

select * from table(dbms_xplan.display(null,null,'alias'));

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------
Plan hash value: 2328412027

-----------------------------------------------------------------------------------------
| Id  | Operation                       | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                |       |   100K|  2734K|   500K (20)| 00:41:41 |
|   1 |  UPDATE                         | T2    |       |       |            |          |
|   2 |   TABLE ACCESS FULL             | T2    |   100K|  2734K|    64   (8)| 00:00:01 |
|   3 |   SORT AGGREGATE                |       |     1 |    72 |            |          |
|   4 |    NESTED LOOPS                 |       |       |       |            |          |
|   5 |     NESTED LOOPS                |       |     1 |    72 |     4   (0)| 00:00:01 |
|   6 |      TABLE ACCESS BY INDEX ROWID| T1    |     1 |    36 |     2   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN          | T1_I1 |     1 |       |     1   (0)| 00:00:01 |
|*  8 |      INDEX RANGE SCAN           | T1_I1 |     1 |       |     1   (0)| 00:00:01 |
|*  9 |     TABLE ACCESS BY INDEX ROWID | T1    |     1 |    36 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - UPD$1
   2 - UPD$1        / T2@UPD$1
   3 - SEL$F5BB74E1
   6 - SEL$F5BB74E1 / T1A@SEL$2
   7 - SEL$F5BB74E1 / T1A@SEL$2
   8 - SEL$F5BB74E1 / T1B@SEL$2
   9 - SEL$F5BB74E1 / T1B@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("T1A"."CONTRACT"=:B1 AND "T1A"."DATE_FROM"=:B2 AND
              "T1A"."DATE_TO"=:B3)
   8 - access("T1B"."CONTRACT"=:B1 AND "T1B"."DATE_TO">:B2 AND
              "T1B"."DATE_FROM"<=:B3)
       filter("T1B"."DATE_TO">:B1 AND "T1B"."CONTRACT"="T1A"."CONTRACT" AND
              "T1B"."DATE_FROM"<="T1A"."DATE_FROM" AND "T1B"."DATE_TO">"T1A"."DATE_FROM")
   9 - filter("T1B"."CREATED"<"T1A"."REPLACED" AND 
              "T1B"."REPLACED">"T1A"."CREATED")

Notice that there’s nothing in the body of the plan that tells you which copy of t1 is visited first – you can’t tell unless you look carefully at the predicate information, or unless you’ve requested the alias section that lets you see very easily that the t1 at line 6 (the first one accessed) is aliased as t1a and the t1 at line 9 is aliased as t1b. While you’re at it, check the plan hash value from the top of the plan output.

Now the plan when we hint the two tables into the reverse order:

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------
Plan hash value: 2328412027

-----------------------------------------------------------------------------------------
| Id  | Operation                       | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                |       |   100K|  2734K|   600K (17)| 00:50:01 |
|   1 |  UPDATE                         | T2    |       |       |            |          |
|   2 |   TABLE ACCESS FULL             | T2    |   100K|  2734K|    64   (8)| 00:00:01 |
|   3 |   SORT AGGREGATE                |       |     1 |    72 |            |          |
|   4 |    NESTED LOOPS                 |       |       |       |            |          |
|   5 |     NESTED LOOPS                |       |     1 |    72 |     5   (0)| 00:00:01 |
|   6 |      TABLE ACCESS BY INDEX ROWID| T1    |     1 |    36 |     3   (0)| 00:00:01 |
|*  7 |       INDEX RANGE SCAN          | T1_I1 |     1 |       |     2   (0)| 00:00:01 |
|*  8 |      INDEX RANGE SCAN           | T1_I1 |     1 |       |     1   (0)| 00:00:01 |
|*  9 |     TABLE ACCESS BY INDEX ROWID | T1    |     1 |    36 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - UPD$1
   2 - UPD$1        / T2@UPD$1
   3 - SEL$F5BB74E1
   6 - SEL$F5BB74E1 / T1B@SEL$2
   7 - SEL$F5BB74E1 / T1B@SEL$2
   8 - SEL$F5BB74E1 / T1A@SEL$2
   9 - SEL$F5BB74E1 / T1A@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("T1B"."CONTRACT"=:B1 AND "T1B"."DATE_TO">:B2 AND
              "T1B"."DATE_FROM"<=:B3) 
       filter("T1B"."DATE_TO">:B1)
   8 - access("T1B"."CONTRACT"="T1A"."CONTRACT" AND "T1A"."DATE_FROM"=:B1 AND
              "T1A"."DATE_TO"=:B2)
       filter("T1A"."CONTRACT"=:B1 AND "T1B"."DATE_FROM"<="T1A"."DATE_FROM" AND 
              "T1B"."DATE_TO">"T1A"."DATE_FROM")
   9 - filter("T1B"."CREATED"<"T1A"."REPLACED" AND 
              "T1B"."REPLACED">"T1A"."CREATED")

The plan hash value is the same – but if you look at the alias section you can see that the t1 we access first is the one with alias t1b.
Now compare the predicate sections, just for line 7 (the initial index access line):

First Plan
   7 - access("T1A"."CONTRACT"=:B1 AND "T1A"."DATE_FROM"=:B2 AND "T1A"."DATE_TO"=:B3)

Second Plan
   7 - access("T1B"."CONTRACT"=:B1 AND "T1B"."DATE_TO">:B2 AND "T1B"."DATE_FROM"<=:B3) 
       filter("T1B"."DATE_TO">:B1)

Although my model data is random garbage, it’s easy to imagine that with a large data set the selectivities of these two predicates could be dramatically different, and there is clearly some “real-world” meaning to the date_to/date_from columns for a given row that the optimizer is unlikely to recognise when looking at the individual column stats for the table. It’s not surprising that a plan with no stats (which results in dynamic sampling) could find a better plan than a table with stats that leaves the optimizer using its default “call it 5% and hope for the best” strategy for range-based joins.

Conclusion

When you reference a table more than once in an execution plan, make sure you look very carefully at the predicate section – or even call for the alias section – so that you know exactly which copy of the table appears at which point in the plan.

Footnote

Although the results don’t mean much for my example, here’s my output from tracing my queries – rather than report the whole query in each case, I’ve just given the hint to show the table order:


/*+ leading(v1.t1a v1.t1b) */

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      6.01       0.00          0     200722     204119      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      6.01       0.00          0     200722     204119      100000

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  UPDATE  T2 (cr=200773 pr=0 pw=0 time=0 us)
    100000     100000     100000   TABLE ACCESS FULL T2 (cr=459 pr=0 pw=0 time=0 us cost=64 size=2800000 card=100000)
    100000     100000     100000   SORT AGGREGATE (cr=200248 pr=0 pw=0 time=0 us)
         0          0          0    NESTED LOOPS  (cr=200248 pr=0 pw=0 time=0 us)
         0          0          0     NESTED LOOPS  (cr=200248 pr=0 pw=0 time=0 us cost=4 size=72 card=1)
         0          0          0      TABLE ACCESS BY INDEX ROWID T1 (cr=200248 pr=0 pw=0 time=0 us cost=2 size=36 card=1)
         0          0          0       INDEX RANGE SCAN T1_I1 (cr=200248 pr=0 pw=0 time=0 us cost=1 size=0 card=1)(object id 81082)
         0          0          0      INDEX RANGE SCAN T1_I1 (cr=0 pr=0 pw=0 time=0 us cost=1 size=0 card=1)(object id 81082)
         0          0          0     TABLE ACCESS BY INDEX ROWID T1 (cr=0 pr=0 pw=0 time=0 us cost=2 size=36 card=1)

/*+ leading(v1.t1b v1.t1a) */

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      6.09       0.00          0     613372     200000      100000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      6.09       0.00          0     613372     200000      100000

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  UPDATE  T2 (cr=613372 pr=0 pw=0 time=1 us)
    100000     100000     100000   TABLE ACCESS FULL T2 (cr=459 pr=0 pw=0 time=0 us cost=64 size=2800000 card=100000)
    100000     100000     100000   SORT AGGREGATE (cr=612913 pr=0 pw=0 time=1 us)
         0          0          0    NESTED LOOPS  (cr=612913 pr=0 pw=0 time=1 us)
         0          0          0     NESTED LOOPS  (cr=612913 pr=0 pw=0 time=1 us cost=5 size=72 card=1)
    166078     166078     166078      TABLE ACCESS BY INDEX ROWID T1 (cr=368051 pr=0 pw=0 time=0 us cost=3 size=36 card=1)
    166078     166078     166078       INDEX RANGE SCAN T1_I1 (cr=202108 pr=0 pw=0 time=0 us cost=2 size=0 card=1)(object id 81082)
         0          0          0      INDEX RANGE SCAN T1_I1 (cr=244862 pr=0 pw=0 time=0 us cost=1 size=0 card=1)(object id 81082)
         0          0          0     TABLE ACCESS BY INDEX ROWID T1 (cr=0 pr=0 pw=0 time=0 us cost=2 size=36 card=1)

There isn’t a lot of difference in the run-time (which is mostly due to the tablescan) – but there’s an obvious difference in the number of buffer visits and the amount of data found for the first t1 access.
Just like the OP my plan varied with stats – though in my case I got the better plan when I had stats, and the worse plan when I deleted stats and the optimizer used dynamic sampling.

Database Thin Cloning: Summary

MH900401300

This post is part of an ongoing series:

Summary

These blog posts on thin cloning have covered a number of ways to create database thin clones, yet there are still some challenges:

Database-managed Cloning with Oracle Clonedb

  • Performance issues
  • Requires dNFS
  • Oracle 11.2.0.2 and higher only
  • No possibility of branching clones
  • No practical way to share datafiles from clones of the source DB at different points in time

Copy on Write with EMC COFW

  • Can cause write overhead on source filesystems
  • No branching of clones
  • Difficulty in maintaining efficient changes on secondary arrays
  • Limit of 16 snapshots per LUN

Redirect on Write with EMC VNX ROW

  • Limit of 256 snapshots per LUN
  • Difficulty maintaining efficient storage of changes from source database on secondary storage array

Write Anywhere File System with NetApp

  • Limit of 255 snapshots per file or LUN
  • Support provided for maintaining source database changes efficiently on a second array; however, they require a number of products and implementation overhead
  • Size limited to 16-100TB depending on the model of array

Allocate on Write with ZFS

  • Unlimited and instantaneous snapshots with no space overhead
  • Possibility of efficiently sending changes from a source array to a secondary array but no documentation exists for these methods

MH900157023

 

Each of these technologies faces different challenges. One obvious constraint is that any solution that depends on a vendor such as EMC, NetApp, or Oracle will tie the solution only to that storage solution. Some concepts such as Clonedb and Illumos ZFS can be run on any storage but have other constraints.

One of the consistent challenges across all of these technologies is the expert knowledge and extensive manual configuration that can impede implementation of these technologies to provide database thin clones. If database thin clones reduce storage so dramatically as well as reduce cloning times drastically then the thin clone technology should have taken off across multiple industries. This technology involving filesystem snapshots has existed since the mid 1990s—almost a decade and a half ago. In 1994, StorageTek introduced the virtual disk in their Iceberg release. In 1995 Iceberg started supporting filesystem snapshots. However after 15 years, database thin cloning is still rarely used.

The answer to this riddle lies in an analogy. The analogy is the Internet: it was around years before the browser was ever used. Before browsers one could do many things that they can do today on the Internet: use email, transfer files (via FTP), go to chat rooms, and use bulletin boards. But until the browser was created most activity on the Internet was from academics. It wasn’t until the introduction of the browser that usage of the Internet exploded. In a similar way it wasn’t until the introduction of Database Virtualization that usage of database thin cloning began to explode.

Stay tuned for new series on Data Virtualization for Databases, i.e. Database Virtualization.
Upcoming blog posts will discuss
  • OEM 12c Database as a Service (DBaaS) for Netapp
  • Snapshot Management Utility (SMU) for ZFS storage Appliance
  • Delphix for any storage even JBODs