Search

Top 60 Oracle Blogs

Recent comments

11.2.0.1

Concurrent Index Creation

When I read the recent post by the optimizer group about the new concurrent gather stats feature added in 11.2.0.2 it reminded me of the fact that I intended to publish something based on the same idea already some time ago.

The Problem

It was motivated by a client's regular need during a transition phase from non-Exadata to Exadata to create literally thousands of indexes with potentially a multitude of (sub-)partitions as fast as possible - as part of a full datapump import job of a multi-terabyte database running 11.1.0.7 and 11.2.0.1 (Exadata V2).

There are actually two issues regarding the index creation part of a large database import:

1. The datapump import performs the index creation only by a single worker thread even when using the PARALLEL worker thread import feature. Although an index could be created in parallel if you have thousands of smaller index objects this single worker thread potentially does not make efficient use of the available hardware resources with high-end configurations, including and in particular Exadata.

2. There is a nasty bug 8604502 that has been introduced with 11.1.0.7 that affects also 11.2.0.1 (fixed in 11.2.0.2 and a generic one-off patch is available on My Oracle Support for 11.1.0.7 and 11.2.0.1): The IMPDP creates all indexes serially, even those supposed to be created in parallel, and only after the creation ALTERs them to the defined PARALLEL degree. Note that the fix actually only fixes the problem at actual execution time, even with the fix installed (and in 11.2.0.2) the SQLFILE option of IMPDP still generates CREATE INDEX DDLs that will always have the parallel degree set to PARALLEL 1 (see MOS document 1289032.1 and bug 10408313 - INDEXES ARE CREATED WITH PARALLEL DEGREE 1 DURING IMPORT which has been closed as not being a bug). This "not-being-a-bug" also affects all other versions that support the datapump utility - the SQLFILE option always generates CREATE INDEX scripts with the parallel degree set to 1 no matter what the actual degree of the index is supposed to be. It's only the ALTER INDEX DDL command following the CREATE INDEX command that sets the parallel degree correctly.

These two issues in combination meant to them that a full database import job took ages to complete the index creation step after loading quite quickly the vast amount of table data in parallel.

In case of partitioned indexes there is another complication independently from the mentioned issues: Oracle uses only one parallel slave per partition for creation - in case of large and/or few partitions this again doesn't make efficient use of the available resources.

Oracle therefore provides several means to speed up index creation and rebuild tasks, in particular the documented DBMS_PCLXUTIL package that is around since the Oracle 8 days to overcome the above mentioned limitation of partitioned index creation by spawning multiple jobs each rebuilding an index partition in parallel.

Another, undocumented feature is the DBMS_INDEX_UTL package that is obviously used internally as part of several maintenance operations, for example those DDLs that include the "UPDATE INDEXES" clause. According to the spec it allows to rebuild multiple indexes concurrently by spawning multiple jobs - however since it is undocumented it might not be safe to use in production-like configurations - furthermore it might be changed in future releases without further notice and therefore is potentially unreliable.

A Solution

Since the client wanted a quick solution that ideally addressed all of the above issues I came up with a simple implementation that uses Advanced Queueing and background jobs to create as many indexes as desired concurrently.

The solution is targeted towards the client's scenario, so the following is assumed:

- There is a SQL file that contains the CREATE INDEX statements. This can easily be generated via IMPDP based on the dump files using the SQLFILE option.

- To address the CREATE INDEX (not-being-a-)bug (the bugfix for the bug 8604502 still generates incorrect CREATE INDEX DDLs with the SQLFILE option of IMPDP as mentioned above) I've created a combination of "sed" and "awk" unix scripts that take the IMPDP SQLFILE potentially including all DDLs commands as input and create a output file that consists solely of the CREATE INDEX commands with correct PARALLEL clauses based on the ALTER INDEX command following the CREATE INDEX in the script

- To address the lengthy index creation process I've created a small PL/SQL package that sets up the required AQ infrastructure, takes the CREATE INDEX DDL file as input, populates a queue with the index creation commands and spawns as many worker threads as specified that will take care of the actual index creation (that in turn might be a parallel index creation)

As a side note it is interesting that Oracle actually allows to build several indexes concurrently on the same segment (which makes totally sense but does probably not happen too often in practice).

Note that in principle this code could be used as a general template to execute arbitrary DDLs concurrently (of course with corresponding modifications).

The following link allows to download an archive that contains the following subdirectories:

- correct_parallel_clause: This directory contains the Unix scripts mentioned above that allow to process a SQLFILE generated by IMPDP and output a DDL file that solely consists of the CREATE INDEX commands contained in the SQLFILE. The generated CREATE INDEX statements also use a correct PARALLEL clause - the degree is taken from the ALTER INDEX DDL command following the CREATE INDEX in the SQLFILE. For further details refer to the README.txt in that directory. Note that the script at present does not handle Domain Indexes, only conventional and bitmap.

- source: Contains the package source for the concurrent index creation, furthermore a package that is required by the provided automated unit testing (see below for more details) and a script that prompts for the required details to initiate a concurrent index creation. The README.txt in that directory provides a quick start guide how to use the concurrent index creation.

- test: Contains two flavours of test harnesses for automated unit testing of the package. One based on the unit testing feature implemented in SQLDeveloper 2.1.1, and another one based on "dbunit", an open-source unit testing framework based on jUnit. The README.txt in the respective subdirectories explain how to use these unit tests.

How to use it

The usage is split into two parts: The first part deals with preparing a suitable text file that consists of the CREATE INDEX commands, the second part is about processing this text file with as many worker threads as desired.

Preparing the file is straightforward: You can use the "transform_all_sql.sh" script to generate the required CREATE INDEX script from a DDL script created via IMPDP SQLFILE.

The script has been tested primarily with bash, sed and awk under Cygwin 1.7.1 and OEL5, different Unix flavors might have different versions of the shell, awk or sed and therefore might behave differently.

Simply put all four Unix scripts in the "correct_parallel_clause" directory into the same directory, mark them as executable and run the "transform_all_sql.sh" like that:

./transform_all_sql.sh < input_file > output_file

where "input_file" is the file generated via IMPDP SQLFILE option and "output_file" will be the result.

In order to perform the parallel index creation, you need an account that has suitable privileges granted. Since it is assumed that the indexes will have to be created in different schemas this account will have to have extended privileges granted. The package is implemented using invoker's rights so granting these privileges via roles is sufficient. A quick and dirty solution could be creating a temporary account and granting simply the DBA role to it (this is what I used to do to test it). Note that the account also requires EXECUTE privileges on the DBMS_AQ and DBMS_AQADM packages for the AQ stuff. It also needs a simple logging table where errors and progress will be written to as well as a type that is used as payload of the queue. Obviously the account also needs to be able to create jobs - in this version of the package this is done via DBMS_SCHEDULER. At execution time the package is going to create a queue plus queue table that also needs to be stored in a tablespace - so you should make sure that the account (or at least the database) that executes the index creation has an appropriate default tablespace defined.

You can simply run the "pk_create_index_concurrent.sql" script (located in the "source" directory) in such a suitable account which will deinstall/install all required objects.

The execution of the index creation is then straightforward (taken from the package specification):

/**
* The main entry point to create indexes via parallel threads / AQ
* @param p_directory_name The directory where the file resides that contains the CREATE INDEX DDLs
* @param p_file_name The file name in the directory above
* @param p_parallel_degree_set_1 The number threads to start for the worker thread 1 which usually
represents the SERIAL_INDEX threads - G_AUTO_PARALLEL_DEGREE means use the CPU_COUNT and
CLUSTER_DATABASE_INSTANCES parameter to determine number of threads automatically
* @param p_parallel_degree_set_2 The number threads to start for the worker thread 2 which usually
represents the PARALLEL_INDEX threads - G_AUTO_PARALLEL_DEGREE means get the CPU_COUNT and
CLUSTER_DATABASE_INSTANCES parameter to determine number of threads automatically,
however 1 is the default here since we assume that these indexes use parallel DDL
* @param p_job_submit_delay The number of seconds each job will be delayed to allow Oracle
proper load balancing in a cluster, default 30 seconds (commented out at present due to
odd locking issues on the queue table in RAC environments)
* @param p_sleep_seconds The number of seconds to wait for the threads to startup
before attempting to teardown the AQ infrastructure again
* @param p_optional_init Optionally a SQL can be passed usually used to initialize the session
for example forcing a particular parallel degree
* @param p_worker_set_id_1
The character identifier used to identify the indexes to process by the first worker thread set
Default value is "SERIAL_INDEX"
* @param p_worker_set_id_2
The character identifier used to identify the indexes to process by the second worker thread set
Default value is "PARALLEL_INDEX"
**/
procedure create_index_concurrent(
p_directory_name in varchar2
, p_file_name in varchar2
, p_parallel_degree_set_1 in integer default G_AUTO_PARALLEL_DEGREE
, p_parallel_degree_set_2 in integer default 1
, p_job_submit_delay in integer default 30
, p_sleep_seconds in integer default 10
, p_optional_init in varchar2 default null
, p_worker_set_id_1 in varchar2 default G_WORKER_SET_ID_1
, p_worker_set_id_2 in varchar2 default G_WORKER_SET_ID_2
);

Note that the "p_job_submit_delay" parameter is currently not used - there were some odd locking issues on the AQ table in case of a RAC environment when using that option so I have commented out its usage at present - I haven't had a chance yet to investigate further what the problem actually was.

So the only required input to the CREATE_INDEX_CONCURRENT procedure is the name of the directory object that points to the directory where the file to process resides and the name of the file itself.

You probably want to specify the number of worker threads for the two sets: The idea here is to distinguish between the creation of serial and parallel indexes. The first parameter specifies the number of worker threads used for serial indexes, the second one the number of concurrent threads for parallel indexes.

The default is CPU_COUNT * INSTANCES threads for serial indexes and a single thread for parallel indexes.

If you don't want/need this separation of serial and parallel indexes simple use the same "worker_set_id" for both parameters "p_worker_set_id_1" and "p_worker_set_id_2" and specify the desired total parallel degree in one of the degree parameters and set the other one to 0 (the 0 is required otherwise one of the DBMS_SCHEDULER.CREATE_JOB calls will fail with a "duplicate job name/job name already exists").

The "p_sleep_seconds" parameter is only used to allow the jobs spawned to put a lock on the queue table - the teardown is then going to wait until all locks have been removed and therefore all queue processing has ended. The default of 10 seconds was sufficient in all cases I've encountered.

Since the package requires as prerequisite a directory where the file to process resides, I've prepared the script "create_index_concurrent.sql" that guides through the required inputs and takes care of that step as well.

It takes the full O/S path to the file and the file name as input, creates a directory CREATE_INDEX_CONCURRENT_DIR pointing to that directory and prompts then for the two degrees as input and the names of the two worker thread sets before calling the CREATE_INDEX_CONCURRENT stored procedure.

Caveats

Please note that you should double-check not to pass a non-transformed SQLFILE generated via IMPDP to the procedure - the results may be dire since the generated SQLFILE always contains much more than the bare CREATE INDEX commands, no matter what options you use for IMPDP. Always use the provided Unix scripts to post-process the SQLFILE before initiating the index creation.

Furthermore you need to be aware of the current limitation of the package that it does not attempt to tokenize the file contents. It simply uses a semicolon as delimiter to separate the DDL commands. This should be sufficient for most cases, but in case you have a function-based index using a string expression containing a semicolon as part of the index definition this will not work as expected. Also if you plan to use this package for other DDL execution activities like CTAS statements you might again hit this limitation if the DDL text contains semicolons.

Note that creating indexes using this tool results potentially in different index statistics than creating the indexes using IMPDP since IMPDP by default also imports the index statistics whereas the indexes created using this tool will end up with the current index statistics automatically generated during index creation (from 10g onwards, and the code requires at least 10.2). If you want to have the index statistics imported you can run IMPDP after the index creation using the INCLUDE=INDEX_STATISTICS option. This should complete fairly quickly and will import the index statistics only.

If you have SERVEROUTPUT enabled by default then you will very likely see some errors that will be printed by the initial attempt to tear down the AQ infrastructure. These errors are expected if the previous run was completed successfully or in case of the initial run and can be ignored (and will be catched/ignored by the default implementation).

Note also that all provided scripts except for the Unix shell scripts use DOS file format - under OEL this isn't a problem but it might be on your platform.

Finally the inevitable disclaimer: Although this has been tested thoroughly it comes with absolutely no warranty. Use it at your own risk and test it in your environment before attempting any runs against anything important.

Monitoring the execution

The code logs errors and progress into the table CREATE_INDEX_CONCURRENT_LOG. At present the code logs every attempt to execute DDL into the table as well as any errors that are raised during that DDL execution.

So the table can be used for both, monitoring the progress as well as checking for errors. The code currently continues the execution in case of errors encountered using the dreaded WHEN OTHERS THEN NULL construct, but the code is already prepared for a more granular error handling if required - see the defined exceptions and commented out exception handler.

You can view the queue contents in the corresponding queue view created by the AQ setup (AQ$CREATE_INDEX_QUEUE) in order to see the data to process. Note that due to the fact that all worker threads do not commit the queue transaction you won't be able to see the progress in the queue table until all worker threads committed. If you don't like that you can remove the wait and "teardown_aq" call at the end of the main procedure "create_index_concurrent" and uncomment the dequeue option "visibility=immediate" in the "create_index_thread" procedure. You would need then to call "teardown_aq" in a separate step as desired. With this modification you can monitor the progress by monitoring the queue, but the provided automated unit testing won't work with that variant since it relies on the main call to wait for all worker threads to complete before validating the results.

However you can see the progress also in the log table using the following sample query:

select
to_char(log_timestamp, 'DD-MON-YYYY HH24:MI:SS.FF') as log_timestamp
, sql_statement
, message
from
create_index_concurrent_log
order by
log_timestamp desc;

If you want to perform more sophisticated queries on the that table you might need to use some casts similar to the following, because the text columns are defined as CLOBs in order to be able to hold the complete DDLs and error messages in case of errors. The casts allow you to perform for example GROUP BYs etc.

select
to_char(log_timestamp, 'DD-MON-YYYY HH24:MI:SS.FF') as log_timestamp
, cast(substr(sql_statement, 1, 30) as varchar2(30)) as index_name
, cast(substr(message, 1, 128) as varchar2(128)) as worker_set_id
from
create_index_concurrent_log
order by
log_timestamp desc;

The Unit Testing

Here we come to a completely different issue that is off-topic for this post, however in my experience so far it seems to be a very important one and I hopefully will have the time to cover it in the future with separate posts.

Generally speaking I've seen to many shops that don't follow best-practice when it comes to database deployment and development, therefore here is what you should know/do about it ideally - in a nutshell:

- Treat your database like source code, which means put everything related to the database under version control. This includes not only the obvious database source code but also DDL and DML scripts for schema evolution
- Use unit testing to test database code. Automate this unit testing
- Automate the deployment of your database related changes
- Install a continuous integration environment that runs the automated deployment and unit tests regularly, for example every night
- Automate deployment everywhere - starting from the development databases up to the production environment
- Follow your guidelines strictly - for example any hotfix-like adhoc change should still go through the established processes - code changes, testing, deployment etc.

I've helped several clients in the past to setup corresponding tools and processes for implementing above - if you are interested, get in touch with me.

So as a bonus, if you haven't spent too much time yet with above mentioned topics, in order to get you started at least with automated unit testing, I've included two different examples for this small source provided, one using the built-in unit test feature of SQLDeveloper and the other one using "dbunit". You can find both in the corresponding subdirectories of the "test" folder in the archive.

The unit testing is based on the "pk_create_index_concur_test.sql" package that is used to setup and teardown the environment for running the unit test. It assumes at present the existence of a directory "C:\app\oracle\admin\orcl112\dpdump" on O/S level. It will create a directory object for the path and attempt to create/write a file used for the unit test runs. You can pass any valid O/S directory path to the "pk_create_index_concur_test.setup" procedure if you want/need to use a different one.

All provided automated tests assume that both scripts, "pk_create_index_concurrent.sql" and "pk_create_index_concur_test.sql" have been run in the schema that should be used for test runs.

You can use the SQLDeveloper Unit Test feature to run the provided Unit Test. You can either use the GUI to import and run the test, or you can use a command line version that is actually using ANT to run the UTUTIL command line tool that comes with SQLDeveloper. You can read and follow the instructions in the "README.txt" in the test/SQLDeveloper directory how to do so. You'll need to setup a unit test repository initially if you want to use SQLDeveloper's unit testing feature either way (GUI or UTUTIL command line). See the SQLDeveloper's user's guide or online help how to do that (Hint: Menu item "Extras->Unit Testing" gets you started).

If you don't like the SQLDeveloper unit test approach or you are simply to lazy to install the tool, the unit test repository etc., you can alternatively try the automated unit testing using "dbunit". Follow the instructions in the "README.txt" in the test/dbunit directory how to run the unit tests using "dbunit".

This version of the package has successfully been tested using these unit tests on 10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1 and 11.2.0.2 (after all it's dead easy with automated unit testing :-).

Summary

The provided tool set should represent a solid foundation for the given task of concurrent index creation. In particular it has been designed with the following in mind:

- Efficient use of privileges granted via roles: The package uses invoker's rights and most operations use dynamic SQL to avoid compilation issues, therefore granting the required privileges to the account used via roles should be sufficient

- The Unix scripts should be able to deal with table-, schema- and database-level datapump formats from Oracle 10g and 11g (all these variants use slightly different texts to identify the relevant sections of the generated SQLFILE by IMPDP)

- Optional use of two separate worker thread sets: This allows the concurrent creation of a multitude of indexes, be it serial or parallel, with clear distinction between the handling of serial (possibly many worker threads) and parallel indexes (usually only a few worker threads)

- Support for arbitrarily sized SQL: The DDL commands for (sub-)partitioned indexes can become quite large due to the way the Oracle meta data API generates the SQL. Therefore these generated SQLs can easily exceed the usual 32KB limit for PL/SQL character strings. The implementation uses CLOBs for the processed SQLs (and DBMS_SQL in versions lower than 11 to handle these piecewise) to support these potentially very large SQLs

- RAC/Grid/Cluster support via DBMS_SCHEDULER: The usage of DBMS_SCHEDULER allows a fine grained control of the resource consumption by the optional use of job classes (not implemented yet but can easily be added - it is a simple additional parameter to the CREATE_JOB procedure) that allow to specify a resource consumer group and a specific service name for the spawned worker threads

- Automated Unit Testing support: The provided unit test harness allows for easy testing of modifications to the code

Pending Statistics

This is just a quick heads-up to those that plan to use the Pending Statistics feature that has been introduced in Oracle 11.1.

It looks like that in all currently available versions that support this feature Pending Statistics have not been implemented consequently for all possible DBMS_STATS calls, so you have to be very careful which calls you use. Having enabled the pending statistics for a particular table you might start to manipulate the statistics under the impression that the modifications performed are not reflected in the actual dictionary statistics (by "dictionary statistics" in this case I don't mean the statistics of the data dictionary objects themselves but the actual statistics of database objects stored in the data dictionary) but only in the pending statistics area allowing you to test statistics modifications in an isolated environment using the OPTIMIZER_USE_PENDING_STATISTICS parameter on session level.

You therefore might be in for a surprise to find out that this holds true only for a limited set of DBMS_STATS calls, but not for all.

This effectively means that particular changes to the statistics will be effective immediately although pending statistics have been enabled.

In particular manipulations of the statistics using the SET_*_STATS procedures of DBMS_STATS seem to ignore the pending statistics settings and still update the dictionary statistics immediately without further notice.

This is rather unfortunate since this means that Pending Statistics can not be used in a straightforward way to test user-defined statistics which can be very helpful under certain circumstances but require extensive testing before using them on a live system.

But also other calls, like gathering statistics only for a particular set of columns show an unexpected behaviour: It looks like that both statistics get modified, the pending statistics area, but also the dictionary statistics.

Note that setting the GLOBAL preferences (DBMS_STATS.SET_GLOBAL_PREFS) for PUBLISH to FALSE seems to fix this particular issue - in that case only the pending statistics get updated, but the dictionary statistics are left unchanged. This fix does not apply to the SET_*_STATS procedures unfortunately, those seem to always update the dictionary statistics.

The worst thing however is that the statistics history that is automatically maintained since Oracle 10g does not reflect these (unintended) changes properly, so you can not easily recover from the potentially unwanted modifications by calling DBMS_STATS.RESTORE_TABLE_STATS.

Finally I was obviously able to activate the pending statistics using DBMS_STATS.RESTORE_TABLE_STATS - you can rather clearly see this behaviour when using the SET_GLOBAL_PREFS('PUBLISH', 'FALSE') variant of the following script.

The following is a small demonstration of the issues encountered - please note that it modifies the GLOBAL preferences for the PUBLISH setting if you intend to run this test by yourself.

set echo on timing on linesize 130 tab off trimspool on

alter session set nls_date_format = 'DD.MM.YYYY HH24:MI:SS';

drop table t purge;

create table t
as
select * from all_objects
where rownum <= 1000;

exec dbms_stats.set_global_prefs('PUBLISH', 'TRUE')

select dbms_stats.get_prefs('PUBLISH', null, 'T') from dual;

-- Our baseline, no histograms, basic column statistics for all columns
exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR ALL COLUMNS SIZE 1')

-- Verify the result
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- Enable pending statistics for table T
-- You can try these different calls
--
-- The GATHER_*_STATS procedures seem to behave correctly
-- only when setting the GLOBAL PREFS to FALSE
--
-- "Correctly" means that the results are reflected in the
-- pending area only but not in the dictionary statistics
--
-- Note that the SET_*_STATS procedures seem to ignore the setting
-- always and publish directly to the dictionary
-- no matter what the PUBLISH setting is on any level (TABLE, GLOBAL)
--
-- exec dbms_stats.set_global_prefs('PUBLISH', 'FALSE')
exec dbms_stats.set_table_prefs(null, 'T', 'PUBLISH', 'FALSE')
-- exec dbms_stats.set_schema_prefs(user, 'PUBLISH', 'FALSE')

-- Verify the current setting, statistics will not be published
select dbms_stats.get_prefs('PUBLISH', null, 'T') from dual;

-- Wait for two seconds to make the LAST_ANALYZED column meaningful
exec dbms_lock.sleep(2)

-- This is supposed to go to the pending statistics area
exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR ALL COLUMNS SIZE 254')

-- Yes, it worked, the dictionary statistics are not modified
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- The pending statistics area contains now the new statistics including histograms
select
num_distinct
, density
, last_analyzed
from
user_col_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histgrm_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histograms
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- Wait for two seconds to make the LAST_ANALYZED column meaningful
exec dbms_lock.sleep(2)

-- Let's gather statistics only for the OBJECT_NAME column
exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR COLUMNS OBJECT_NAME SIZE 1')

-- Oops, why do my dictionary statistics reflect that change (Note the LAST_ANALYZED column)
-- Except for you set the GLOBAL preferences for PUBLISH to FALSE
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- I do have now the statistics updated in both, pending statistics and dictionary statistics
-- Except for you set the GLOBAL preferences for PUBLISH to FALSE
select
num_distinct
, density
, last_analyzed
from
user_col_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histgrm_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histograms
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- Wait for two seconds to make the LAST_ANALYZED column meaningful
exec dbms_lock.sleep(2)

-- Let's recreate the histogram only on the OBJECT_NAME column
exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR COLUMNS OBJECT_NAME SIZE 254')

-- Oops, I did it again...
-- Except for you set the GLOBAL preferences for PUBLISH to FALSE
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
num_distinct
, density
, last_analyzed
from
user_col_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histgrm_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histograms
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- Wait for two seconds to make the LAST_ANALYZED column meaningful
exec dbms_lock.sleep(2)

-- Let's define a manually crafted NDV and DENSITY value
-- Again I expect this to go to the pending statistics area
declare
srec dbms_stats.statrec;
novals dbms_stats.numarray;
distcnt number;
avgclen number;
nullcnt number;
density number;
begin
dbms_stats.get_column_stats(null, 't', 'object_name', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
dbms_stats.set_column_stats(
ownname=>null,
tabname=>'t',
colname=>'object_name',
distcnt=>distcnt*100,
nullcnt=>nullcnt,
srec=>srec,
avgclen=>avgclen,
density=>density/100
);
end;
/

-- Nope, no change here
select
num_distinct
, density
, last_analyzed
from
user_col_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- But I just changed it in the dictionary statistics
-- Even in case of setting the GLOBAL preference to FALSE
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

-- And what is even worse: The statistics history does not reflect all these changes to the dictionary statistics
select table_name, stats_update_time from USER_TAB_STATS_HISTORY where table_name = 'T';

exec dbms_stats.restore_table_stats(null, 'T', systimestamp)

-- But which statistics have been restored now?
-- It looks like this actually restored the PENDING statistics
-- according to the LAST_ANALYZED information??
select
num_distinct
, density
, last_analyzed
, num_buckets
, user_stats
from
user_tab_col_statistics
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histgrm_pending_stats
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

select
count(*)
from
user_tab_histograms
where
table_name = 'T'
and column_name = 'OBJECT_NAME';

exec dbms_stats.set_global_prefs('PUBLISH', 'TRUE')

And this is what I get from running this on 11.1.0.7, 11.2.0.1 or 11.2.0.2:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> alter session set nls_date_format = 'DD.MM.YYYY HH24:MI:SS';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> drop table t purge;

Table dropped.

Elapsed: 00:00:00.20
SQL>
SQL> create table t
2 as
3 select * from all_objects
4 where rownum <= 1000;

Table created.

Elapsed: 00:00:00.35
SQL>
SQL> exec dbms_stats.set_global_prefs('PUBLISH', 'TRUE')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> select dbms_stats.get_prefs('PUBLISH', null, 'T') from dual;

DBMS_STATS.GET_PREFS('PUBLISH',NULL,'T')
----------------------------------------------------------------------------------------------------------------------------------
TRUE

Elapsed: 00:00:00.00
SQL>
SQL> -- Our baseline, no histograms, basic column statistics for all columns
SQL> exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR ALL COLUMNS SIZE 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.09
SQL>
SQL> -- Verify the result
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
995 .001005025 18.01.2011 18:58:24 1 NO

Elapsed: 00:00:00.01
SQL>
SQL> -- Enable pending statistics for table T
SQL> -- You can try these different calls
SQL> --
SQL> -- The GATHER_*_STATS procedures seem to behave correctly
SQL> -- only when setting the GLOBAL PREFS to FALSE
SQL> --
SQL> -- "Correctly" means that the results are reflected in the
SQL> -- pending area only but not in the dictionary statistics
SQL> --
SQL> -- Note that the SET_*_STATS procedures seem to ignore the setting
SQL> -- always and publish directly to the dictionary
SQL> -- no matter what the PUBLISH setting is on any level (TABLE, GLOBAL)
SQL> --
SQL> -- exec dbms_stats.set_global_prefs('PUBLISH', 'FALSE')
SQL> exec dbms_stats.set_table_prefs(null, 'T', 'PUBLISH', 'FALSE')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL> -- exec dbms_stats.set_schema_prefs(user, 'PUBLISH', 'FALSE')
SQL>
SQL> -- Verify the current setting, statistics will not be published
SQL> select dbms_stats.get_prefs('PUBLISH', null, 'T') from dual;

DBMS_STATS.GET_PREFS('PUBLISH',NULL,'T')
----------------------------------------------------------------------------------------------------------------------------------
FALSE

Elapsed: 00:00:00.01
SQL>
SQL> -- Wait for two seconds to make the LAST_ANALYZED column meaningful
SQL> exec dbms_lock.sleep(2)

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.01
SQL>
SQL> -- This is supposed to go to the pending statistics area
SQL> exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR ALL COLUMNS SIZE 254')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.99
SQL>
SQL> -- Yes, it worked, the dictionary statistics are not modified
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
995 .001005025 18.01.2011 18:58:24 1 NO

Elapsed: 00:00:00.01
SQL>
SQL> -- The pending statistics area contains now the new statistics including histograms
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 from
6 user_col_pending_stats
7 where
8 table_name = 'T'
9 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED
------------ ---------- -------------------
995 .00101 18.01.2011 18:58:26

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histgrm_pending_stats
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
255

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histograms
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
2

Elapsed: 00:00:00.01
SQL>
SQL> -- Wait for two seconds to make the LAST_ANALYZED column meaningful
SQL> exec dbms_lock.sleep(2)

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.01
SQL>
SQL> -- Let's gather statistics only for the OBJECT_NAME column
SQL> exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR COLUMNS OBJECT_NAME SIZE 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.14
SQL>
SQL> -- Oops, why do my dictionary statistics reflect that change (Note the LAST_ANALYZED column)
SQL> -- Except for you set the GLOBAL preferences for PUBLISH to FALSE
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
995 .001005025 18.01.2011 18:58:29 1 NO

Elapsed: 00:00:00.01
SQL>
SQL> -- I do have now the statistics updated in both, pending statistics and dictionary statistics
SQL> -- Except for you set the GLOBAL preferences for PUBLISH to FALSE
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 from
6 user_col_pending_stats
7 where
8 table_name = 'T'
9 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED
------------ ---------- -------------------
995 .001005025 18.01.2011 18:58:29

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histgrm_pending_stats
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
0

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histograms
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
2

Elapsed: 00:00:00.01
SQL>
SQL> -- Wait for two seconds to make the LAST_ANALYZED column meaningful
SQL> exec dbms_lock.sleep(2)

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.01
SQL>
SQL> -- Let's recreate the histogram only on the OBJECT_NAME column
SQL> exec dbms_stats.gather_table_stats(null, 'T', estimate_percent => null, cascade => false, method_opt => 'FOR COLUMNS OBJECT_NAME SIZE 254')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.34
SQL>
SQL> -- Oops, I did it again...
SQL> -- Except for you set the GLOBAL preferences for PUBLISH to FALSE
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
995 .00101 18.01.2011 18:58:32 254 NO

Elapsed: 00:00:00.01
SQL>
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 from
6 user_col_pending_stats
7 where
8 table_name = 'T'
9 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED
------------ ---------- -------------------
995 .00101 18.01.2011 18:58:32

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histgrm_pending_stats
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
255

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histograms
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
255

Elapsed: 00:00:00.01
SQL>
SQL> -- Wait for two seconds to make the LAST_ANALYZED column meaningful
SQL> exec dbms_lock.sleep(2)

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.01
SQL>
SQL> -- Let's define a manually crafted NDV and DENSITY value
SQL> -- Again I expect this to go to the pending statistics area
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 begin
9 dbms_stats.get_column_stats(null, 't', 'object_name', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
10 dbms_stats.set_column_stats(
11 ownname=>null,
12 tabname=>'t',
13 colname=>'object_name',
14 distcnt=>distcnt*100,
15 nullcnt=>nullcnt,
16 srec=>srec,
17 avgclen=>avgclen,
18 density=>density/100
19 );
20 end;
21 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.17
SQL>
SQL> -- Nope, no change here
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 from
6 user_col_pending_stats
7 where
8 table_name = 'T'
9 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED
------------ ---------- -------------------
995 .00101 18.01.2011 18:58:32

Elapsed: 00:00:00.01
SQL>
SQL> -- But I just changed it in the dictionary statistics
SQL> -- Even in case of setting the GLOBAL preference to FALSE
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
99500 .0000101 18.01.2011 18:58:34 254 YES

Elapsed: 00:00:00.01
SQL>
SQL> -- And what is even worse: The statistics history does not reflect all these changes to the dictionary statistics
SQL> select table_name, stats_update_time from USER_TAB_STATS_HISTORY where table_name = 'T';

TABLE_NAME STATS_UPDATE_TIME
------------------------------ ---------------------------------------------------------------------------
T 18-JAN-11 06.58.24.391000 PM +01:00

Elapsed: 00:00:00.01
SQL>
SQL> exec dbms_stats.restore_table_stats(null, 'T', systimestamp)

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.29
SQL>
SQL> -- But which statistics have been restored now?
SQL> -- It looks like this actually restored the PENDING statistics
SQL> -- according to the LAST_ANALYZED information??
SQL> select
2 num_distinct
3 , density
4 , last_analyzed
5 , num_buckets
6 , user_stats
7 from
8 user_tab_col_statistics
9 where
10 table_name = 'T'
11 and column_name = 'OBJECT_NAME';

NUM_DISTINCT DENSITY LAST_ANALYZED NUM_BUCKETS USE
------------ ---------- ------------------- ----------- ---
995 .00101 18.01.2011 18:58:32 254 NO

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histgrm_pending_stats
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
255

Elapsed: 00:00:00.01
SQL>
SQL> select
2 count(*)
3 from
4 user_tab_histograms
5 where
6 table_name = 'T'
7 and column_name = 'OBJECT_NAME';

COUNT(*)
----------
255

Elapsed: 00:00:00.01
SQL>
SQL> exec dbms_stats.set_global_prefs('PUBLISH', 'TRUE')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>

It is also interesting to note that, although Oracle has added an array of new dictionary views that allow to access the pending statistics area, these views are inconsistent with the existing statistics-related views in terms of naming conventions, columns and behaviour.

For example the *_col_pending_stats view does not have a NUM_BUCKETS column, and the corresponding *_tab_histgrm_pending_stats view for histogram details shows 0 rows if basic column statistics have been defined but no histogram whereas the original *_tab_histograms returns two rows if basic column statistics have been collected representing the low and high value of the column.

Summary

The Pending Statistics feature clearly shows some unexpected behaviour. In particular you better don't rely on it preventing the dictionary statistics from being updated by DBMS_STATS calls with the PUBLISH attribute set to FALSE.

Some of this clearly looks like a bug but I couldn't find a corresponding bug entry yet in My Oracle Support.

Note that this post does not cover the actual usage of Pending Statistics by the optimizer - I haven't done any extensive testing in this regard, but some quick checks showed that it seems to work as expected, which means that the optimizer picks statistics for those tables that have Pending Statistics defined when setting OPTIMIZER_USE_PENDING_STATISTICS = TRUE, but still uses the statistics from the dictionary for those that don't have any pending statistics defined.

Hash Aggregation

Oracle introduced in Oracle 10g the hash aggregation as a new feature. It can be used for both GROUP BY and UNIQUE operations (HASH GROUP BY and HASH UNIQUE respectively) and is by default the preferred aggregation method if there is no particular reason that lets the cost based optimizer prefer the sort based aggregation (SORT GROUP BY and SORT UNIQUE), for example if the GROUP BY is followed by an ORDER BY on the same expression (using the SORT GROUP BY in such cases is not always beneficial by the way, see Guy Harrison's blog for an example).

Ever since its introduction from time to time I've heard complaints about performance degradations of aggregation operations that are based on the new hash aggregation algorithm compared to the previously used sort based aggregations.

Now there may be many potential reasons for such performance degradations (and possibly many of them might not have anything to do with the hash aggregation) but here is a surprising revelation that might explain why some of them were indeed caused by the switch to the new algorithm: The hash aggregation operation does not work very well together with the automatic PGA management (WORKAREA_SIZE_POLICY = AUTO, default since 9i). The fundamental defect is that it is not able to dynamically resize to a larger workarea size when using automatic PGA management and therefore remains more or less at its initial expected size based on the estimates at optimization time.

This effectively means that the efficiency of the hash aggregation operation when using automatic PGA management is heavily dependant on the cardinality estimates at optimization time - in case of estimates in the right ballpark, the memory used at execution time will correspond to the actual requirements at runtime, but in case of bad estimates, the operation potentially uses far less memory than available and unnecessarily spills to disk.

Let's start with a simple script to demonstrate the issue:

set echo on timing on

show parameter pga
show parameter processes

-- alter session set workarea_size_policy = manual sort_area_size = 40000000;

drop table t1 purge;

drop table t2 purge;

create table t1
as
select mod(rownum, 27) as user_id, rownum as object_id from dual connect by level <= 30000;

create table t2
as
select distinct user_id from t1;

exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

column low_val new_value low_value
column high_val new_value high_value

select
max(user_id) + 1 as low_val
, max(user_id) + max(user_id) - min(user_id) + 1 as high_val
from
t2;

-- Invalidate any cursors using T1
comment on table t1 is '';

alter session set statistics_level = all;

-- alter session set tracefile_identifier = 'smm_trace_correct_cardinality';

-- alter session set "_smm_trace" = 65535;

select /*+ full(@inner t1)
full(@inner t2)
full(@inner t3)
leading(@inner t2 t3 t1)
use_nl(@inner t3)
use_hash(@inner t1)
no_swap_join_inputs(@inner t1)
use_hash_aggregation(@inner)
*/
max(cnt)
from
(
select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
from t1, t2, t2 t3
where t1.user_id = t2.user_id
group by t1.object_id, t3.user_id
)
;

-- alter session set "_smm_trace" = 0;

set pagesize 0 linesize 200 trimspool on tab off

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

set pagesize 14

-- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
declare
srec dbms_stats.statrec;
novals dbms_stats.numarray;
distcnt number;
avgclen number;
nullcnt number;
density number;
srec2 dbms_stats.statrec;
begin
dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
srec2.epc := 2;
novals := dbms_stats.numarray(
&low_value,
&high_value
);
srec2.bkvals := null;
dbms_stats.prepare_column_values(srec2,novals);
dbms_stats.set_column_stats(
ownname=>null,
tabname=>'t1',
colname=>'user_id',
distcnt=>distcnt,
nullcnt=>nullcnt,
srec=>srec2,
avgclen=>avgclen,
density=>density
);
end;
/

-- Invalidate any cursors using T1
comment on table t1 is '';

alter session set statistics_level = all;

-- alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';

-- alter session set "_smm_trace" = 65535;

select /*+ full(@inner t1)
full(@inner t2)
full(@inner t3)
leading(@inner t2 t3 t1)
use_nl(@inner t3)
use_hash(@inner t1)
no_swap_join_inputs(@inner t1)
use_hash_aggregation(@inner)
*/
max(cnt)
from
(
select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
from t1, t2, t2 t3
where t1.user_id = t2.user_id
group by t1.object_id, t3.user_id
)
;

-- alter session set "_smm_trace" = 0;

set pagesize 0 linesize 200 trimspool on tab off

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

I first start with showing the current PGA_AGGREGATE_TARGET (P_A_T) and PROCESSES setting.

I create then two tables as sample data that will be joined and already collect information about the minimum and maximum value of the join column.

Any cursors that are dependent on table T1 will then be invalidated to make sure that a reoptimization of the next query will take place (of course running the complete script invalidates such cursors anyway due to the drop and re-create of the tables).

I then run a simple join followed by a group by operation. I've added a cartesian join by the way, but its only purpose is to ensure that the generated row source is sufficiently large to give the group by something to do.

I have used hints to ensure that I always get the same execution plan even if I later on manipulate the statistics to see the effect of incorrect cardinality estimates.

After getting the actual runtime execution plan along with execution statistics I modify the column statistics of one of the tables' join column in such a way that the optimizer thinks that there is no overlap between the join column values and therefore computes a very low join cardinality.

After making sure again that a reoptimization will take place I run the same statement again with the same data volume and the same hints in place.

And this is what I get for all major versions that are currently out there (10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1 and 11.2.0.2):

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 200M
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 0
job_queue_processes integer 10
log_archive_max_processes integer 2
processes integer 150
SQL>
SQL> -- alter session set workarea_size_policy = manual sort_area_size = 40000000;
SQL>
SQL> drop table t1 purge;
drop table t1 purge
*
ERROR at line 1:
ORA-00942: table or view does not exist

Elapsed: 00:00:00.01
SQL>
SQL> drop table t2 purge;
drop table t2 purge
*
ERROR at line 1:
ORA-00942: table or view does not exist

Elapsed: 00:00:00.00
SQL>
SQL> create table t1
2 as
3 select mod(rownum, 27) as user_id, rownum as object_id from dual connect by level <= 30000;

Table created.

Elapsed: 00:00:00.07
SQL>
SQL> create table t2
2 as
3 select distinct user_id from t1;

Table created.

Elapsed: 00:00:00.06
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.09
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> column low_val new_value low_value
SQL> column high_val new_value high_value
SQL>
SQL> select
2 max(user_id) + 1 as low_val
3 , max(user_id) + max(user_id) - min(user_id) + 1 as high_val
4 from
5 t2;

LOW_VAL HIGH_VAL
---------- ----------
27 53

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.04
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> -- alter session set tracefile_identifier = 'smm_trace_correct_cardinality';
SQL>
SQL> -- alter session set "_smm_trace" = 65535;
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:00.71
SQL>
SQL> -- alter session set "_smm_trace" = 0;
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner
t2 t3 t1) use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner
t1) use_hash_aggregation(@inner) */ max(cnt) from ( select /*+ qb_name(inner) */ count(*)
as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by t1.object_id,
t3.user_id )

Plan hash value: 3134842094

--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.72 | 139 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.72 | 139 | | | |
| 2 | VIEW | | 1 | 570K| 810K|00:00:00.43 | 139 | | | |
| 3 | HASH GROUP BY | | 1 | 570K| 810K|00:00:00.43 | 139 | 35M| 5521K| 38M (0)|
|* 4 | HASH JOIN | | 1 | 807K| 810K|00:00:00.02 | 139 | 1348K| 1348K| 1093K (0)|
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29907 | 30000 |00:00:00.01 | 55 | | | |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

29 rows selected.

Elapsed: 00:00:00.12
SQL>
SQL> set pagesize 14
SQL>
SQL> -- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 srec2 dbms_stats.statrec;
9 begin
10 dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
11 srec2.epc := 2;
12 novals := dbms_stats.numarray(
13 &low_value,
14 &high_value
15 );
16 srec2.bkvals := null;
17 dbms_stats.prepare_column_values(srec2,novals);
18 dbms_stats.set_column_stats(
19 ownname=>null,
20 tabname=>'t1',
21 colname=>'user_id',
22 distcnt=>distcnt,
23 nullcnt=>nullcnt,
24 srec=>srec2,
25 avgclen=>avgclen,
26 density=>density
27 );
28 end;
29 /
old 13: &low_value,
new 13: 27,
old 14: &high_value
new 14: 53

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.03
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> -- alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';
SQL>
SQL> -- alter session set "_smm_trace" = 65535;
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:01.37
SQL>
SQL> -- alter session set "_smm_trace" = 0;
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner t1) use_hash_aggregation(@inner)
*/ max(cnt) from ( select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id =
t2.user_id group by t1.object_id, t3.user_id )

Plan hash value: 3134842094

------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.37 | 139 | 2715 | 2715 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.37 | 139 | 2715 | 2715 | | | | |
| 2 | VIEW | | 1 | 1 | 810K|00:00:00.62 | 139 | 2715 | 2715 | | | | |
| 3 | HASH GROUP BY | | 1 | 1 | 810K|00:00:00.62 | 139 | 2715 | 2715 | 35M| 5521K| 7010K (1)| 23552 |
|* 4 | HASH JOIN | | 1 | 1 | 810K|00:00:00.01 | 139 | 0 | 0 | 1348K| 1348K| 1167K (0)| |
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | 0 | 0 | | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | 0 | 0 | | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | 0 | 0 | | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29907 | 30000 |00:00:00.01 | 55 | 0 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

28 rows selected.

Elapsed: 00:00:00.03
SQL>

As it can be seen with an estimate in the right ballpark the HASH GROUP BY operation completes in memory (very close to the 20% PGA_AGGREGATE_TARGET maximum size of a single workarea from 10.2 on with automatic PGA management for PGA_AGGREGATE_TARGET < 500M, for more information see Joze Senegacnik's paper on the internals of automatic PGA management).

However repeating exactly the same operation with the fudged statistics it spills to disk and uses only 7M, although it could have used up to 40M (given that there is no concurrent workload).

The same does not happen when repeating this experiment with other operations that use a workarea - the most obvious one being a SORT GROUP BY, as can be seen from this output:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 200M
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 0
job_queue_processes integer 10
log_archive_max_processes integer 2
processes integer 150
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:00.04
SQL>
SQL> drop table t2 purge;

Table dropped.

Elapsed: 00:00:00.04
SQL>
SQL> create table t1
2 as
3 select mod(rownum, 15) as user_id, rownum as object_id from dual connect by level <= 30000;

Table created.

Elapsed: 00:00:00.06
SQL>
SQL> create table t2
2 as
3 select distinct user_id from t1;

Table created.

Elapsed: 00:00:00.03
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.09
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> column low_val new_value low_value
SQL> column high_val new_value high_value
SQL>
SQL> select
2 max(user_id) + 1 as low_val
3 , max(user_id) + max(user_id) - min(user_id) + 1 as high_val
4 from
5 t2;

LOW_VAL HIGH_VAL
---------- ----------
15 29

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> -- alter session set tracefile_identifier = 'smm_trace_correct_cardinality';
SQL>
SQL> -- alter session set "_smm_trace" = 65535;
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 no_use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:00.54
SQL>
SQL> -- alter session set "_smm_trace" = 0;
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 94tqnspjuzk8x, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner
t2 t3 t1) use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner
t1) no_use_hash_aggregation(@inner) */ max(cnt) from ( select /*+ qb_name(inner) */
count(*) as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by
t1.object_id, t3.user_id )

Plan hash value: 2068941295

--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.53 | 103 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.53 | 103 | | | |
| 2 | VIEW | | 1 | 315K| 450K|00:00:00.37 | 103 | | | |
| 3 | SORT GROUP BY | | 1 | 315K| 450K|00:00:00.37 | 103 | 28M| 1913K| 25M (0)|
|* 4 | HASH JOIN | | 1 | 446K| 450K|00:00:00.01 | 103 | 1348K| 1348K| 1097K (0)|
| 5 | NESTED LOOPS | | 1 | 225 | 225 |00:00:00.01 | 48 | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 15 | 15 |00:00:00.01 | 3 | | | |
| 7 | TABLE ACCESS FULL| T2 | 15 | 15 | 225 |00:00:00.01 | 45 | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29787 | 30000 |00:00:00.01 | 55 | | | |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

29 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> set pagesize 14
SQL>
SQL> -- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 srec2 dbms_stats.statrec;
9 begin
10 dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
11 srec2.epc := 2;
12 novals := dbms_stats.numarray(
13 &low_value,
14 &high_value
15 );
16 srec2.bkvals := null;
17 dbms_stats.prepare_column_values(srec2,novals);
18 dbms_stats.set_column_stats(
19 ownname=>null,
20 tabname=>'t1',
21 colname=>'user_id',
22 distcnt=>distcnt,
23 nullcnt=>nullcnt,
24 srec=>srec2,
25 avgclen=>avgclen,
26 density=>density
27 );
28 end;
29 /
old 13: &low_value,
new 13: 15,
old 14: &high_value
new 14: 29

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> -- alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';
SQL>
SQL> -- alter session set "_smm_trace" = 65535;
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 no_use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:00.56
SQL>
SQL> -- alter session set "_smm_trace" = 0;
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 94tqnspjuzk8x, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner
t2 t3 t1) use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner
t1) no_use_hash_aggregation(@inner) */ max(cnt) from ( select /*+ qb_name(inner) */
count(*) as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by
t1.object_id, t3.user_id )

Plan hash value: 2068941295

--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.55 | 103 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.55 | 103 | | | |
| 2 | VIEW | | 1 | 1 | 450K|00:00:00.38 | 103 | | | |
| 3 | SORT GROUP BY | | 1 | 1 | 450K|00:00:00.38 | 103 | 28M| 1913K| 25M (0)|
|* 4 | HASH JOIN | | 1 | 1 | 450K|00:00:00.01 | 103 | 1348K| 1348K| 1053K (0)|
| 5 | NESTED LOOPS | | 1 | 225 | 225 |00:00:00.01 | 48 | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 15 | 15 |00:00:00.01 | 3 | | | |
| 7 | TABLE ACCESS FULL| T2 | 15 | 15 | 225 |00:00:00.01 | 45 | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29787 | 30000 |00:00:00.01 | 55 | | | |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

29 rows selected.

Elapsed: 00:00:00.03
SQL>

All I've done is to reduce the data set, because the SORT GROUP BY in this case required more memory for the same amount of data than the HASH GROUP BY, in order to prevent the operation to spill to disk with a 200M PGA_AGGREGATE_TARGET setting and change the USE_HASH_AGGREGATION hint to NO_USE_HASH_AGGREGATION.

As you can see, the operation completes both times in memory and uses the same amount of memory no matter what the estimates look like.

I've tested the serial execution of the most common workarea based operations like HASH JOIN, SORT ORDER BY, WINDOW SORT, SORT UNIQUE and all of them were able to dynamically resize the workarea in cases where the initial estimated size was too small.

If you carefully check then you'll notice that I haven't mentioned the HASH UNIQUE operation yet, and later on you'll see why.

A cunning feature of the automatic PGA management comes to help, however, which is a kind of "feedback loop" for the workareas based on statistics maintained by the automatic PGA memory management, and indeed when the HASH GROUP BY cursor based on the incorrect cardinality estimate gets shared (not invalidated) and re-executed, the next execution will look like this:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_wrong_cardinality_repeated_execution';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:00.76
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner t1)
use_hash_aggregation(@inner) */ max(cnt) from ( select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id,
t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by t1.object_id, t3.user_id )

Plan hash value: 3134842094

------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.75 | 139 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.75 | 139 | | | | |
| 2 | VIEW | | 1 | 1 | 810K|00:00:00.44 | 139 | | | | |
| 3 | HASH GROUP BY | | 1 | 1 | 810K|00:00:00.44 | 139 | 35M| 5521K| 38M (0)| |
|* 4 | HASH JOIN | | 1 | 1 | 810K|00:00:00.01 | 139 | 1348K| 1348K| 1412K (0)| |
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29836 | 30000 |00:00:00.01 | 55 | | | | |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

28 rows selected.

Elapsed: 00:00:00.03
SQL>

So Oracle this time based the workarea memory requirements on the feedback from the previous execution and therefore allocated sufficient memory to complete the operation without spilling to disk.

Notice however that the feedback loop unfortunately does not work as desired when the workarea execution fails due to insufficient TEMP space. Ideally the feedback loop should allow the subsequent executions to grab more memory specifically in such cases, but at present it doesn't - the failure obviously prevents an update of the statistics and therefore subsequent executions continue to fail since they still use the same amount of memory.

You can test this by simply assigned a very small TEMP tablespace to the user executing the query so that the second execution fails due to insufficient TEMP space. If you repeat the execution in this case, it will fail again and keeps doing so.

What happens if I invalidate the cursor and repeat the execution?

SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.01
SQL> set echo on timing on
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_wrong_cardinality_repeated_exec_invalid';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;
1

Elapsed: 00:00:01.29
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner t1) use_hash_aggregation(@inner)
*/ max(cnt) from ( select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id =
t2.user_id group by t1.object_id, t3.user_id )

Plan hash value: 3134842094

------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.28 | 139 | 2715 | 2715 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.28 | 139 | 2715 | 2715 | | | | |
| 2 | VIEW | | 1 | 1 | 810K|00:00:00.61 | 139 | 2715 | 2715 | | | | |
| 3 | HASH GROUP BY | | 1 | 1 | 810K|00:00:00.61 | 139 | 2715 | 2715 | 35M| 5521K| 7067K (1)| 23552 |
|* 4 | HASH JOIN | | 1 | 1 | 810K|00:00:00.01 | 139 | 0 | 0 | 1348K| 1348K| 1129K (0)| |
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | 0 | 0 | | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | 0 | 0 | | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | 0 | 0 | | | | |
| 8 | TABLE ACCESS FULL | T1 | 1 | 29836 | 30000 |00:00:00.01 | 55 | 0 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("T1"."USER_ID"="T2"."USER_ID")

28 rows selected.

Elapsed: 00:00:00.03
SQL>

Back to square one - with the invalidation the statistics are gone, too, so the bad cardinality estimate again lead to the suboptimal execution of the HASH GROUP BY.

So you might think, why bother? Only the first execution of a cursor without workarea execution statistics will be affected by the problem, subsequent execution of the same cursor will benefit from the statistics from the previous executions.

The problem however is, that this is fine for applications that share cursors. Unfortunately applications that peform heavy duty aggregations like data warehouses typically do not share cursors, since they do not care about the optimization overhead and deliberately use literals to provide as much information to the optimizer as possible.

Also these heavy duty aggregations usually use Parallel Execution features, and as you'll see from the output of the same test case, if I run the HASH GROUP BY in parallel by simply setting table T1 to parallel degree 2, a similar problem occurs - so Parallel Execution is affected as well.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 200M
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 0
job_queue_processes integer 10
log_archive_max_processes integer 2
processes integer 150
SQL>
SQL> -- alter session set workarea_size_policy = manual sort_area_size = 40000000;
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:00.01
SQL>
SQL> drop table t2 purge;

Table dropped.

Elapsed: 00:00:00.04
SQL>
SQL> create table t1
2 as
3 select mod(rownum, 27) as user_id, rownum as object_id from dual connect by level <= 30000;

Table created.

Elapsed: 00:00:00.06
SQL>
SQL> alter table t1 parallel 2;

Table altered.

Elapsed: 00:00:00.03
SQL>
SQL> create table t2
2 as
3 select distinct user_id from t1;

Table created.

Elapsed: 00:00:00.03
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> column low_val new_value low_value
SQL> column high_val new_value high_value
SQL>
SQL> select
2 max(user_id) + 1 as low_val
3 , max(user_id) + max(user_id) - min(user_id) + 1 as high_val
4 from
5 t2;

LOW_VAL HIGH_VAL
---------- ----------
27 53

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_correct_cardinality';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:00.42
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_hash(@inner t1) no_swap_join_inputs(@inner t1)
use_hash_aggregation(@inner) */ max(cnt) from ( select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id,
t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by t1.object_id, t3.user_id )

Plan hash value: 464898991

--------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | O/1/M |
--------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.43 | 87 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.43 | 87 | | | |
| 2 | PX COORDINATOR | | 1 | | 2 |00:00:00.43 | 87 | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1 | 0 |00:00:00.01 | 0 | | | |
| 4 | SORT AGGREGATE | | 2 | 1 | 2 |00:00:00.81 | 0 | | | |
| 5 | VIEW | | 2 | 575K| 810K|00:00:00.48 | 0 | | | |
| 6 | HASH GROUP BY | | 2 | 575K| 810K|00:00:00.48 | 0 | 35M| 5521K| 2/0/0|
| 7 | PX RECEIVE | | 2 | 814K| 810K|00:00:00.01 | 0 | | | |
| 8 | PX SEND HASH | :TQ10001 | 0 | 814K| 0 |00:00:00.01 | 0 | | | |
|* 9 | HASH JOIN | | 2 | 814K| 810K|00:00:00.01 | 89 | 1348K| 1348K| 2/0/0|
| 10 | BUFFER SORT | | 2 | | 1458 |00:00:00.01 | 0 | 29696 | 29696 | 2/0/0|
| 11 | PX RECEIVE | | 2 | 729 | 1458 |00:00:00.01 | 0 | | | |
| 12 | PX SEND BROADCAST | :TQ10000 | 0 | 729 | 0 |00:00:00.01 | 0 | | | |
| 13 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | | | |
| 14 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | | | |
| 15 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | | | |
| 16 | PX BLOCK ITERATOR | | 2 | 30164 | 30000 |00:00:00.01 | 89 | | | |
|* 17 | TABLE ACCESS FULL | T1 | 18 | 30164 | 30000 |00:00:00.01 | 89 | | | |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

9 - access("T1"."USER_ID"="T2"."USER_ID")
17 - access(:Z>=:Z AND :Z<=:Z)

38 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> set pagesize 14
SQL>
SQL> -- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 srec2 dbms_stats.statrec;
9 begin
10 dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
11 srec2.epc := 2;
12 novals := dbms_stats.numarray(
13 &low_value,
14 &high_value
15 );
16 srec2.bkvals := null;
17 dbms_stats.prepare_column_values(srec2,novals);
18 dbms_stats.set_column_stats(
19 ownname=>null,
20 tabname=>'t1',
21 colname=>'user_id',
22 distcnt=>distcnt,
23 nullcnt=>nullcnt,
24 srec=>srec2,
25 avgclen=>avgclen,
26 density=>density
27 );
28 end;
29 /
old 13: &low_value,
new 13: 27,
old 14: &high_value
new 14: 53

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_hash(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 use_hash_aggregation(@inner)
9 */
10 max(cnt)
11 from
12 (
13 select /*+ qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 group by t1.object_id, t3.user_id
17 )
18 ;

MAX(CNT)
----------
1

Elapsed: 00:00:01.17
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS'));
SQL_ID 2s1sdfhvhajv3, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2) full(@inner t3) leading(@inner t2 t3 t1) use_nl(@inner t3)
use_hash(@inner t1) no_swap_join_inputs(@inner t1) use_hash_aggregation(@inner) */ max(cnt) from ( select /*+
qb_name(inner) */ count(*) as cnt, t1.object_id, t3.user_id from t1, t2, t2 t3 where t1.user_id = t2.user_id group by t1.object_id, t3.user_id )

Plan hash value: 464898991

------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.16 | 87 | 0 | 0 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.16 | 87 | 0 | 0 | | | | |
| 2 | PX COORDINATOR | | 1 | | 2 |00:00:01.16 | 87 | 0 | 0 | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1 | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 4 | SORT AGGREGATE | | 2 | 1 | 2 |00:00:02.25 | 0 | 2460 | 2460 | | | | |
| 5 | VIEW | | 2 | 1 | 810K|00:00:01.53 | 0 | 2460 | 2460 | | | | |
| 6 | HASH GROUP BY | | 2 | 1 | 810K|00:00:01.53 | 0 | 2460 | 2460 | 35M| 5521K| | 11264 |
| 7 | PX RECEIVE | | 2 | 1 | 810K|00:00:00.01 | 0 | 0 | 0 | | | | |
| 8 | PX SEND HASH | :TQ10001 | 0 | 1 | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
|* 9 | HASH JOIN | | 2 | 1 | 810K|00:00:00.01 | 89 | 0 | 0 | 1348K| 1348K| 2/0/0| |
| 10 | BUFFER SORT | | 2 | | 1458 |00:00:00.01 | 0 | 0 | 0 | 29696 | 29696 | 2/0/0| |
| 11 | PX RECEIVE | | 2 | 729 | 1458 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 12 | PX SEND BROADCAST | :TQ10000 | 0 | 729 | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 13 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | 0 | 0 | | | | |
| 14 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | 0 | 0 | | | | |
| 15 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | 0 | 0 | | | | |
| 16 | PX BLOCK ITERATOR | | 2 | 30164 | 30000 |00:00:00.01 | 89 | 0 | 0 | | | | |
|* 17 | TABLE ACCESS FULL | T1 | 18 | 30164 | 30000 |00:00:00.01 | 89 | 0 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

9 - access("T1"."USER_ID"="T2"."USER_ID")
17 - access(:Z>=:Z AND :Z<=:Z)

37 rows selected.

Elapsed: 00:00:00.04
SQL>

So, if your database uses automatic PGA management and

- uses the hash aggregation HASH GROUP BY (or HASH UNIQUE, more on that in a moment)
- and does not share cursors

every execution that is based on bad cardinality estimates potentially has a problem with the hash aggregation because it might not make efficient use of the available memory.

The same applies to applications that share cursors, however in that case only the first execution after re-optimization / invalidation is affected.

So you might want to carefully check the runtime execution statistics of your critial hash aggregations.

Mind you, things could be worse, and that is where the HASH UNIQUE operation comes into the picture.

When I realised the issue with the HASH GROUP BY operation I was quite certain that the HASH UNIQUE operation will be affected in a similar way, since internally Oracle seems to use the same mechanism for both operations (In the SMM trace files both are called HASH GROUP BY).

To my surprise I noticed that in 10g versions below 10.2.0.5 and 11g versions below 11.2.0.1 (which means in this case 10.2.0.4 and 11.1.0.7 respectively, I didn't test other versions) the HASH UNIQUE operation suffers from an even more dramatic problem: The cardinality estimate is not considered and the initial workarea size is always based on minimum assumptions.

In passing, according to the SMM trace files that can be activated using the undocumented "_smm_trace" parameter it looks like that many of the sort-based workareas like SORT ORDER BY or WINDOW SORT seem to suffer from the same defect, since they however are able to dynamically resize and make use of the workarea statistics feedback they effectively work as expected, they just start with a too low workarea size estimate every time they are executed for the first time after re-optimization / invalidation.

The combination of the two issues - inability to dynamically resize and ignoring the optimizer estimates - leads to a dire result: Every first execution of a HASH UNIQUE operation in those versions will only use the minimum amount of memory. The following test case shows the problem:

set echo on timing on

show parameter pga
show parameter processes

-- alter session set workarea_size_policy = manual sort_area_size = 40000000;

drop table t1 purge;

drop table t2 purge;

create table t1
as
select mod(rownum, 27) as user_id, rownum as object_id from dual connect by level <= 30000;

create table t2
as
select distinct user_id from t1;

exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

column low_val new_value low_value
column high_val new_value high_value

select
max(user_id) + 1 as low_val
, max(user_id) + max(user_id) - min(user_id) + 1 as high_val
from
t2;

-- Invalidate any cursors using T1
comment on table t1 is '';

alter session set statistics_level = all;

alter session set tracefile_identifier = 'smm_trace_correct_cardinality';

alter session set "_smm_trace" = 65535;

select /*+ full(@inner t1)
full(@inner t2)
full(@inner t3)
leading(@inner t2 t3 t1)
use_nl(@inner t3)
use_nl(@inner t1)
use_hash_aggregation(@inner)
*/
max(user_id)
from
(
select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
from t1, t2, t2 t3
where t1.user_id = t2.user_id
)
;

alter session set "_smm_trace" = 0;

set pagesize 0 linesize 200 trimspool on tab off

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

set pagesize 14

-- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
declare
srec dbms_stats.statrec;
novals dbms_stats.numarray;
distcnt number;
avgclen number;
nullcnt number;
density number;
srec2 dbms_stats.statrec;
begin
dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
srec2.epc := 2;
novals := dbms_stats.numarray(
&low_value,
&high_value
);
srec2.bkvals := null;
dbms_stats.prepare_column_values(srec2,novals);
dbms_stats.set_column_stats(
ownname=>null,
tabname=>'t1',
colname=>'user_id',
distcnt=>distcnt,
nullcnt=>nullcnt,
srec=>srec2,
avgclen=>avgclen,
density=>density
);
end;
/

-- Invalidate any cursors using T1
comment on table t1 is '';

alter session set statistics_level = all;

alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';

alter session set "_smm_trace" = 65535;

select /*+ full(@inner t1)
full(@inner t2)
full(@inner t3)
leading(@inner t2 t3 t1)
use_nl(@inner t3)
use_nl(@inner t1)
use_hash_aggregation(@inner)
*/
max(user_id)
from
(
select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
from t1, t2, t2 t3
where t1.user_id = t2.user_id
)
;

alter session set "_smm_trace" = 0;

set pagesize 0 linesize 200 trimspool on tab off

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

Obviously it is very similar to the HASH GROUP BY test case, this time however I've used a DISTINCT clause and replaced the HASH JOIN with a NESTED LOOP which makes the generated trace files easier to read, since there is exactly one workarea involved in this execution.

And this is what I get from 11.1.0.7, this time using AMM, and therefore the PGA_AGGREGAT_TARGET has been set to 0 (You'll get the same result from 10.2.0.4 with a corresponding P_A_T setting):

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 0
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 0
global_txn_processes integer 1
job_queue_processes integer 1000
log_archive_max_processes integer 4
processes integer 150
SQL>
SQL> -- alter session set workarea_size_policy = manual sort_area_size = 40000000;
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:00.01
SQL>
SQL> drop table t2 purge;

Table dropped.

Elapsed: 00:00:00.00
SQL>
SQL> create table t1
2 as
3 select mod(rownum, 27) as user_id, rownum as object_id from dual connect by level <= 30000;

Table created.

Elapsed: 00:00:00.04
SQL>
SQL> create table t2
2 as
3 select distinct user_id from t1;

Table created.

Elapsed: 00:00:00.04
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.06
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>
SQL> column low_val new_value low_value
SQL> column high_val new_value high_value
SQL>
SQL> select
2 max(user_id) + 1 as low_val
3 , max(user_id) + max(user_id) - min(user_id) + 1 as high_val
4 from
5 t2;

LOW_VAL HIGH_VAL
---------- ----------
27 53

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_correct_cardinality';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_nl(@inner t1)
7 use_hash_aggregation(@inner)
8 */
9 max(user_id)
10 from
11 (
12 select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
13 from t1, t2, t2 t3
14 where t1.user_id = t2.user_id
15 )
16 ;

MAX(USER_ID)
------------
26

Elapsed: 00:00:01.79
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 7cnqz02uwcs1a, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2)
full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_nl(@inner t1)
use_hash_aggregation(@inner) */ max(user_id) from ( select /*+
qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id from t1,
t2, t2 t3 where t1.user_id = t2.user_id )

Plan hash value: 1541846686

------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.78 | 40179 | 1470 | 1470 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.78 | 40179 | 1470 | 1470 | | | | |
| 2 | VIEW | | 1 | 572K| 810K|00:00:00.97 | 40179 | 1470 | 1470 | | | | |
| 3 | HASH UNIQUE | | 1 | 572K| 810K|00:00:00.97 | 40179 | 1470 | 1470 | 25M| 3296K| 6029K (1)| 13312 |
| 4 | NESTED LOOPS | | 1 | 810K| 810K|00:00:00.01 | 40179 | 0 | 0 | | | | |
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | 0 | 0 | | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | 0 | 0 | | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | 0 | 0 | | | | |
|* 8 | TABLE ACCESS FULL | T1 | 729 | 1111 | 810K|00:00:00.03 | 40095 | 0 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T1"."USER_ID"="T2"."USER_ID")

30 rows selected.

Elapsed: 00:00:00.03
SQL>
SQL> set pagesize 14
SQL>
SQL> -- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 srec2 dbms_stats.statrec;
9 begin
10 dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
11 srec2.epc := 2;
12 novals := dbms_stats.numarray(
13 &low_value,
14 &high_value
15 );
16 srec2.bkvals := null;
17 dbms_stats.prepare_column_values(srec2,novals);
18 dbms_stats.set_column_stats(
19 ownname=>null,
20 tabname=>'t1',
21 colname=>'user_id',
22 distcnt=>distcnt,
23 nullcnt=>nullcnt,
24 srec=>srec2,
25 avgclen=>avgclen,
26 density=>density
27 );
28 end;
29 /
old 13: &low_value,
new 13: 27,
old 14: &high_value
new 14: 53

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_nl(@inner t1)
7 use_hash_aggregation(@inner)
8 */
9 max(user_id)
10 from
11 (
12 select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
13 from t1, t2, t2 t3
14 where t1.user_id = t2.user_id
15 )
16 ;

MAX(USER_ID)
------------
26

Elapsed: 00:00:01.26
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID 7cnqz02uwcs1a, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2)
full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_nl(@inner t1)
use_hash_aggregation(@inner) */ max(user_id) from ( select /*+
qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id from t1,
t2, t2 t3 where t1.user_id = t2.user_id )

Plan hash value: 1541846686

------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp|
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:01.26 | 40179 | 1470 | 1470 | | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:01.26 | 40179 | 1470 | 1470 | | | | |
| 2 | VIEW | | 1 | 1 | 810K|00:00:00.98 | 40179 | 1470 | 1470 | | | | |
| 3 | HASH UNIQUE | | 1 | 1 | 810K|00:00:00.98 | 40179 | 1470 | 1470 | 25M| 3296K| 6093K (1)| 13312 |
| 4 | NESTED LOOPS | | 1 | 1 | 810K|00:00:00.01 | 40179 | 0 | 0 | | | | |
| 5 | NESTED LOOPS | | 1 | 729 | 729 |00:00:00.01 | 84 | 0 | 0 | | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 27 | 27 |00:00:00.01 | 3 | 0 | 0 | | | | |
| 7 | TABLE ACCESS FULL| T2 | 27 | 27 | 729 |00:00:00.01 | 81 | 0 | 0 | | | | |
|* 8 | TABLE ACCESS FULL | T1 | 729 | 1 | 810K|00:00:00.01 | 40095 | 0 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T1"."USER_ID"="T2"."USER_ID")

30 rows selected.

Elapsed: 00:00:00.03
SQL>

As you can see no matter what the estimates looks like, on this system the first execution of a HASH UNIQUE will not get more than 6M - subsequent executions of the same cursor benefit from the statistics from the previous run as seen before.

Again, switching to a SORT UNIQUE and reducing the data set accordingly, the problem can not be reproduced:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL>
SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 0
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 0
db_writer_processes integer 1
gcs_server_processes integer 0
global_txn_processes integer 1
job_queue_processes integer 1000
log_archive_max_processes integer 4
processes integer 150
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:00.15
SQL>
SQL> drop table t2 purge;

Table dropped.

Elapsed: 00:00:00.07
SQL>
SQL> create table t1
2 as
3 select mod(rownum, 15) as user_id, rownum as object_id from dual connect by level <= 30000;

Table created.

Elapsed: 00:00:00.06
SQL>
SQL> create table t2
2 as
3 select distinct user_id from t1;

Table created.

Elapsed: 00:00:00.04
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't2', method_opt => 'for all columns size 1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> column low_val new_value low_value
SQL> column high_val new_value high_value
SQL>
SQL> select
2 max(user_id) + 1 as low_val
3 , max(user_id) + max(user_id) - min(user_id) + 1 as high_val
4 from
5 t2;

LOW_VAL HIGH_VAL
---------- ----------
15 29

Elapsed: 00:00:00.00
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_correct_cardinality';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_nl(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 no_use_hash_aggregation(@inner)
9 */
10 max(user_id)
11 from
12 (
13 select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 )
17 ;

MAX(USER_ID)
------------
14

Elapsed: 00:00:00.73
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID ckb2sbz3y2z14, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2)
full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_nl(@inner t1)
no_swap_join_inputs(@inner t1)
no_use_hash_aggregation(@inner) */ max(user_id) from ( select
/*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id from
t1, t2, t2 t3 where t1.user_id = t2.user_id )

Plan hash value: 3828303002

--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.71 | 12423 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.71 | 12423 | | | |
| 2 | VIEW | | 1 | 318K| 450K|00:00:00.62 | 12423 | | | |
| 3 | SORT UNIQUE | | 1 | 318K| 450K|00:00:00.62 | 12423 | 22M| 1744K| 20M (0)|
| 4 | NESTED LOOPS | | 1 | 450K| 450K|00:00:00.01 | 12423 | | | |
| 5 | NESTED LOOPS | | 1 | 225 | 225 |00:00:00.01 | 48 | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 15 | 15 |00:00:00.01 | 3 | | | |
| 7 | TABLE ACCESS FULL| T2 | 15 | 15 | 225 |00:00:00.01 | 45 | | | |
|* 8 | TABLE ACCESS FULL | T1 | 225 | 2000 | 450K|00:00:00.01 | 12375 | | | |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T1"."USER_ID"="T2"."USER_ID")

31 rows selected.

Elapsed: 00:00:00.06
SQL>
SQL> set pagesize 14
SQL>
SQL> -- Fudge the statistics so that the join cardinality between t1 and t2 will be 0 (rounded up to 1)
SQL> declare
2 srec dbms_stats.statrec;
3 novals dbms_stats.numarray;
4 distcnt number;
5 avgclen number;
6 nullcnt number;
7 density number;
8 srec2 dbms_stats.statrec;
9 begin
10 dbms_stats.get_column_stats(null, 't1', 'user_id', distcnt => distcnt, avgclen => avgclen, nullcnt => nullcnt, density => density, srec => srec);
11 srec2.epc := 2;
12 novals := dbms_stats.numarray(
13 &low_value,
14 &high_value
15 );
16 srec2.bkvals := null;
17 dbms_stats.prepare_column_values(srec2,novals);
18 dbms_stats.set_column_stats(
19 ownname=>null,
20 tabname=>'t1',
21 colname=>'user_id',
22 distcnt=>distcnt,
23 nullcnt=>nullcnt,
24 srec=>srec2,
25 avgclen=>avgclen,
26 density=>density
27 );
28 end;
29 /
old 13: &low_value,
new 13: 15,
old 14: &high_value
new 14: 29

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.06
SQL>
SQL> -- Invalidate any cursors using T1
SQL> comment on table t1 is '';

Comment created.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set tracefile_identifier = 'smm_trace_wrong_cardinality';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set "_smm_trace" = 65535;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> select /*+ full(@inner t1)
2 full(@inner t2)
3 full(@inner t3)
4 leading(@inner t2 t3 t1)
5 use_nl(@inner t3)
6 use_nl(@inner t1)
7 no_swap_join_inputs(@inner t1)
8 no_use_hash_aggregation(@inner)
9 */
10 max(user_id)
11 from
12 (
13 select /*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id
14 from t1, t2, t2 t3
15 where t1.user_id = t2.user_id
16 )
17 ;

MAX(USER_ID)
------------
14

Elapsed: 00:00:00.74
SQL>
SQL> alter session set "_smm_trace" = 0;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set pagesize 0 linesize 200 trimspool on tab off
SQL>
SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
SQL_ID ckb2sbz3y2z14, child number 0
-------------------------------------
select /*+ full(@inner t1) full(@inner t2)
full(@inner t3) leading(@inner t2 t3 t1)
use_nl(@inner t3) use_nl(@inner t1)
no_swap_join_inputs(@inner t1)
no_use_hash_aggregation(@inner) */ max(user_id) from ( select
/*+ qb_name(inner) no_merge */ distinct t1.object_id, t3.user_id from
t1, t2, t2 t3 where t1.user_id = t2.user_id )

Plan hash value: 3828303002

--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.73 | 12423 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.73 | 12423 | | | |
| 2 | VIEW | | 1 | 1 | 450K|00:00:00.61 | 12423 | | | |
| 3 | SORT UNIQUE | | 1 | 1 | 450K|00:00:00.61 | 12423 | 22M| 1744K| 20M (0)|
| 4 | NESTED LOOPS | | 1 | 1 | 450K|00:00:00.01 | 12423 | | | |
| 5 | NESTED LOOPS | | 1 | 225 | 225 |00:00:00.01 | 48 | | | |
| 6 | TABLE ACCESS FULL| T2 | 1 | 15 | 15 |00:00:00.01 | 3 | | | |
| 7 | TABLE ACCESS FULL| T2 | 15 | 15 | 225 |00:00:00.01 | 45 | | | |
|* 8 | TABLE ACCESS FULL | T1 | 225 | 1 | 450K|00:00:00.01 | 12375 | | | |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

8 - filter("T1"."USER_ID"="T2"."USER_ID")

31 rows selected.

Elapsed: 00:00:00.01
SQL>
SQL> set doc off
SQL> doc
SQL>

In 10.2.0.5 and 11.2.0.1/11.2.0.2 this issue is apparently fixed and the cardinality estimates are used for an initial workarea size calculation - however this doesn't mean that the dynamic resize problem is fixed - it simply means that the HASH UNIQUE in those versions behaves exactly as the HASH GROUP BY and the memory usage for the first execution depends on the cardinality estimates.

Summary

If your database uses automatic PGA management then for the hash aggregation HASH GROUP BY / HASH UNIQUE operations every initial execution that is based on bad cardinality estimates potentially has a problem because it might not make efficient use of the available memory.

The same applies to applications that share cursors, however in that case only the initial execution after re-optimization / invalidation is affected, subsequent executions benefit from the workarea statistics feedback mechanism.

Furthermore in 10g versions below 10.2.0.5 and 11g versions below 11.2.0.1 the initial execution of a HASH UNIQUE operation ignores the cardinality estimates and will always be based on minimum assumptions.

So you might want to carefully check the runtime execution statistics of your critical hash aggregations.

Possible Workarounds

Different strategies are available as workarounds, depending on the situation:

- Upgrading to versions where the HASH UNIQUE operation at least considers the optimizer estimates might be beneficial

- Obviously good cardinality estimates are crucial, in that case and when using the correct Oracle versions you should be fine. Using the undocumented OPT_ESTIMATE hint (or the deprecated undocumented CARDINALITY hint) might help in cases where other options like manually crafted statistics are not able to help the optimizer to come up with reasonable cardinality estimates.

- Applications that are able to share cursors might not be too much affected due to the ability to use the workarea executions for subsequent executions of the same cursor

- The described problems disappear when switching to manual workarea size policy and allowing for sufficient memory of workarea. Interestingly the less obvious SORT_AREA_SIZE is used for the manual control of the HASH GROUP BY operation, and not the HASH_AREA_SIZE. Of course using manual PGA memory management system-wide is highly unrecommended, so this might only be a workaround in certain cases of large batch jobs where manual workarea sizes are used anyway. Also note, when switching to manual workareas be aware of a nasty bug that was introduced in the 10.2.0.3 patchset. For more information see MOS note "6053134.8: ALTER SESSION to set SORT_AREA_SIZE no honoured" and for example mine and Jonathen Lewis' post about the issue. According to the MOS note the bug has been fixed in 10.2.0.4.3, 10.2.0.5 and 11.1.0.7 / 11.2

- Obviously the hash aggregation can be avoided on statement level using the NO_USE_HASH_AGGREGATION hint or on session / system level using the _GBY_HASH_AGGREGATION_ENABLED parameter. Since the sort based aggregation can be less efficient (but note that it doesn't have to be, it depends on the individual grouping and data pattern) again this doesn't necessarily solve the problem since aggregate workareas might spill to disk which wouldn't be necessary when hash aggregation was used and worked as expected.

Final Note

Of course there are a lot of things that I haven't touched yet or only briefly, like Parallel Execution of the hash aggregations, workareas used for index sorts, and all the other details when the workarea spills to disk.

If you want to get an idea what kind of nasty things can happen in that case you might want to read Jonathan Lewis' post on analytic functions.

I also haven't tried yet to fiddle around with the undocumented _smm* related parameters if setting them to non-default values allows to work around the issue.

And of course then there are bugs like this one: Bug 6817844 - Multi pass sort with auto memory management even with plenty of PGA [ID 6817844.8], which is also mentioned in Jonathan's post.

As a final note, the odd TEMP tablespace I/O pattern issue that Jonathan describes in his post (Bug 9041800) is marked as fixed in the 11.2.0.2 patch set, but I haven't tested this yet.

11.2 New Features - Subtle Restrictions

This is just a short note that some of the new features added in 11.2 have some subtle restrictions that are not documented properly.

1. Hybrid Columnar Compression and Function-Based Indexes

The Hybrid Columnar Compression (HCC) introduced in 11.2 that is only enabled in tablespaces residing on Exadata Storage in 11.2.0.1 has the subtle restriction that the mere presence of a function-based index either explicitly created or indirectly via an index on a virtual column disables HCC completely on that segment with direct path / parallel DML inserts. It silently falls back to no compression at all, rather than using for example basic or advanced oltp compression instead.

This can be easily reproduced using the following simple test case:

set echo on timing on

spool hcc_function_based_indexes_testcase.log

drop table t_hcc_fbi purge;

-- Create table with HCC enabled
create table t_hcc_fbi compress for query low as select * from dba_objects where 1 = 2;

truncate table t_hcc_fbi;

-- Direct path inserts support HCC by default
insert /*+ append */ into t_hcc_fbi select * from dba_objects;

commit;

-- This is supposed to return DBMS_COMPRESSION.COMP_FOR_QUERY_LOW = 8
select distinct dbms_compression.get_compression_type(USER, 'T_HCC_FBI', rowid) from t_hcc_fbi;

-- So it is supported to have an FBI on HCC data
create index t_hcc_fbi_idx on t_hcc_fbi (lower(object_type));

truncate table t_hcc_fbi;

-- Let's make the FBI unusable, so it doesn't have to be maintained by the load
alter index t_hcc_fbi_idx unusable;

insert /*+ append */ into t_hcc_fbi select * from dba_objects;

commit;

-- But this will return DBMS_COMPRESSION.COMP_NOCOMPRESS = 1
select distinct dbms_compression.get_compression_type(USER, 'T_HCC_FBI', rowid) from t_hcc_fbi;

-- So the mere existence of a FBI, no matter if usable or unusable will prevent HCC with direct-path inserts / parallel DML
-- It is however fully supported to have them on HCC data, but you need to drop / re-create them in order to load the data
-- via direct-path inserts / parallel DML with HCC enabled

spool off

And this is the output I get from 11.2.0.1:

SQL>
SQL> drop table t_hcc_fbi purge;

Table dropped.

Elapsed: 00:00:00.13
SQL>
SQL> -- Create table with HCC enabled
SQL> create table t_hcc_fbi compress for query low as select * from dba_objects where 1 = 2;

Table created.

Elapsed: 00:00:00.09
SQL>
SQL> truncate table t_hcc_fbi;

Table truncated.

Elapsed: 00:00:00.01
SQL>
SQL> -- Direct path inserts support HCC by default
SQL> insert /*+ append */ into t_hcc_fbi select * from dba_objects;

72545 rows created.

Elapsed: 00:00:01.24
SQL>
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL>
SQL> -- This is supposed to return DBMS_COMPRESSION.COMP_FOR_QUERY_LOW = 8
SQL> select distinct dbms_compression.get_compression_type(USER, 'T_HCC_FBI', rowid) from t_hcc_fbi;

DBMS_COMPRESSION.GET_COMPRESSION_TYPE(USER,'T_HCC_FBI',ROWID)
-------------------------------------------------------------
8

Elapsed: 00:00:09.51
SQL>
SQL> -- So it is supported to have an FBI on HCC data
SQL> create index t_hcc_fbi_idx on t_hcc_fbi (lower(object_type));

Index created.

Elapsed: 00:00:00.23
SQL>
SQL> truncate table t_hcc_fbi;

Table truncated.

Elapsed: 00:00:00.03
SQL>
SQL> -- Let's make the FBI unusable, so it doesn't have to be maintained by the load
SQL> alter index t_hcc_fbi_idx unusable;

Index altered.

Elapsed: 00:00:00.02
SQL>
SQL> insert /*+ append */ into t_hcc_fbi select * from dba_objects;

72546 rows created.

Elapsed: 00:00:01.01
SQL>
SQL> commit;

Commit complete.

Elapsed: 00:00:00.01
SQL>
SQL> -- But this will return DBMS_COMPRESSION.COMP_NOCOMPRESS = 1
SQL> select distinct dbms_compression.get_compression_type(USER, 'T_HCC_FBI', rowid) from t_hcc_fbi;

DBMS_COMPRESSION.GET_COMPRESSION_TYPE(USER,'T_HCC_FBI',ROWID)
-------------------------------------------------------------
1

Elapsed: 00:00:08.93
SQL>

I've been told that this restriction is going to be lifted in some future version, but I haven't had the chance yet to test this with the 11.2.0.2 patch set, in particular because the patch set is not yet released for existing Exadata installations as upgrade - it is only available for new Exadata deployments.

2. Deferred Segment Creation and Parallel DML

11.2 introduced the deferred segment creation that interestingly is enabled by default when doing a fresh installation. I haven't done a migration from previous versions to 11.2 yet so I don't know if this is also enabled in migrated environments but I assume so.

Note that the 11.2.0.2 patch set extends the deferred segment creation to partitioned objects which wasn't supported in the initial 11.2.0.1 release.

The deferred segment creation has some subtle side effects, for example sequences seem not to start with the defined START WITH value because the recursive transaction fails initially, the segment gets created and the next sequence value will be used for the actual DML insert operation.

Another side effect that is not properly documented is the fact that parallel DML is not supported on "segment-less" objects.

So if you have a freshly created segment that is supposed to be populated via parallel DML inserts then this will silently fall back to serial direct-path inserts for the first time executed.

Once the segment has been created from the next time on parallel DML is going to be used which might help to confuse the issue since it is only reproducible under particular circumstances.

So if you have migrated an application to 11.2 and wonder why sometimes load operations take significantly longer than before but mostly do not then this might be a viable explanation.

This behaviour is not part of the deferred segment creation restrictions described in the official documentation, but it is mentioned towards the end of the MOS note "11.2 Database New Feature Deferred Segment Creation [ID 887962.1]"

According to the mentioned MOS note this still seems to apply to 11.2.0.2, but I haven't tested this yet myself. So it looks like this significant restriction is not yet lifted with the current patch set.

Update July 2011: See comment below by Radoslav, the problem seems to be fixed in 11.2.0.2.

Here is a simple test case demonstrating the issue:

set echo on timing on

set linesize 160 trimspool on tab off pagesize 0

spool deferred_segment_creation_parallel_dml_testcase.log

drop table t_source purge;

drop table t_deferred_parallel_dml purge;

-- Create a simple table as source for the insert
create table t_source parallel nologging
as
select * from dba_objects;

exec dbms_stats.gather_table_stats(null, 't_source')

-- No AUTO DOP for this session to prevent any side-effects of this feature
alter session set parallel_degree_policy = manual;

-- Note that even a CTAS operation can be segment-less
-- if the query doesn't return any rows
create table t_deferred_parallel_dml
segment creation deferred
as
select * from t_source where 1 = 2;

-- No segment although CTAS has been used
select count(*) from user_segments where segment_name = 'T_DEFERRED_PARALLEL_DML';

alter session enable parallel dml;

explain plan for
insert /*+ append parallel(t) */ into t_deferred_parallel_dml t select * from t_source;

-- This plan does not show parallel DML
select * from table(dbms_xplan.display);

drop table t_deferred_parallel_dml purge;

create table t_deferred_parallel_dml
segment creation immediate
as
select * from t_source where 1 = 2;

-- One segment
select count(*) from user_segments where segment_name = 'T_DEFERRED_PARALLEL_DML';

explain plan for
insert /*+ append parallel(t) */ into t_deferred_parallel_dml t select * from t_source;

-- This plan does show parallel DML
select * from table(dbms_xplan.display);

spool off

And this is the output I get from 11.2.0.1:

SQL>
SQL> drop table t_source purge;

Table dropped.

Elapsed: 00:00:00.04
SQL>
SQL> drop table t_deferred_parallel_dml purge;

Table dropped.

Elapsed: 00:00:00.01
SQL>
SQL> -- Create a simple table as source for the insert
SQL> create table t_source parallel nologging
2 as
3 select * from dba_objects;

Table created.

Elapsed: 00:00:03.32
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't_source')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.64
SQL>
SQL> -- No AUTO DOP for this session to prevent any side-effects of this feature
SQL> alter session set parallel_degree_policy = manual;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> -- Note that even a CTAS operation can be segment-less
SQL> -- if the query doesn't return any rows
SQL> create table t_deferred_parallel_dml
2 segment creation deferred
3 as
4 select * from t_source where 1 = 2;

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> -- No segment although CTAS has been used
SQL> select count(*) from user_segments where segment_name = 'T_DEFERRED_PARALLEL_DML';
0

Elapsed: 00:00:00.00
SQL>
SQL> alter session enable parallel dml;

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> explain plan for
2 insert /*+ append parallel(t) */ into t_deferred_parallel_dml t select * from t_source;

Explained.

Elapsed: 00:00:00.01
SQL>
SQL> -- This plan does not show parallel DML
SQL> select * from table(dbms_xplan.display);
Plan hash value: 1453990251

------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 73251 | 6938K| 117 (1)| 00:00:02 | | | |
| 1 | LOAD AS SELECT | T_DEFERRED_PARALLEL_DML | | | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM)| :TQ10000 | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | P->S | QC (RAND) |
| 4 | PX BLOCK ITERATOR | | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | PCWC | |
| 5 | TABLE ACCESS FULL| T_SOURCE | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | PCWP | |
------------------------------------------------------------------------------------------------------------------------------

12 rows selected.

Elapsed: 00:00:00.06
SQL>
SQL> drop table t_deferred_parallel_dml purge;

Table dropped.

Elapsed: 00:00:00.03
SQL>
SQL> create table t_deferred_parallel_dml
2 segment creation immediate
3 as
4 select * from t_source where 1 = 2;

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> -- One segment
SQL> select count(*) from user_segments where segment_name = 'T_DEFERRED_PARALLEL_DML';
1

Elapsed: 00:00:00.01
SQL>
SQL> explain plan for
2 insert /*+ append parallel(t) */ into t_deferred_parallel_dml t select * from t_source;

Explained.

Elapsed: 00:00:00.01
SQL>
SQL> -- This plan does show parallel DML
SQL> select * from table(dbms_xplan.display);
Plan hash value: 4111574378

------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 73251 | 6938K| 117 (1)| 00:00:02 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10000 | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT | T_DEFERRED_PARALLEL_DML | | | | | Q1,00 | PCWP | |
| 4 | PX BLOCK ITERATOR | | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | PCWC | |
| 5 | TABLE ACCESS FULL| T_SOURCE | 73251 | 6938K| 117 (1)| 00:00:02 | Q1,00 | PCWP | |
------------------------------------------------------------------------------------------------------------------------------

12 rows selected.

Elapsed: 00:00:00.06
SQL>
SQL> spool off

In general I would advise to disable deferred segment creation and enable it only in environments where a clear benefit can be seen and the side effects are known, understood and tested.

This can be accomplished by simply setting "deferred_segment_creation = false" on instance level.

Cleanup of partitioned objects residing in offlined tablespaces

If you have the scenario that a large database that contains historical data - typically a data warehouse with partitioned tables where the partitions reside in different tablespaces depending on their age - is supposed to be duplicated but the target environment, typically a lower environment, like duplicating Production to a UAT environment, doesn't have sufficient space to hold the complete database and in particular the whole historic partitions, then you somehow need to deal with that partial clone.

Of course, in an ideal world the lower environment is supposed to have sufficient space to hold the complete database, but we don't live in an ideal world. Sometimes, depending on the storage technology, you might be able to do some kind of temporary "split mirror" operation that allows you to start up a complete clone of the database and drop all the affected objects / tablespaces before putting the cleaned up / partial database on the final target system, but this is not always applicable, for instance when using ASM-based storage.

The issue in that particular case is that you can do for instance a partial RMAN clone that omits the historical tablespaces, but the resulting database then obviously has some tablespaces that are unavailable and can't be onlined since the datafiles are simply missing.

In case of objects that are completely residing in the offlined tablespaces the solution is simple and straightforward: The objects can be dropped or the affected tablespaces can simply be dropped with the "including contents" clause.

Things get more complicated however in case of partitioned objects if some of the partitions reside in the offlined tablespaces.

In that case you can't simply drop the offlines tablespaces - you end up with Oracle error message "ORA-14404: partitioned table contains partitions in a different tablespace" for example that tells you that objects exist that have some of their partitions not residing in the offlined tablespace and Oracle obviously won't do the cleanup job for you in that case.

Oracle allows to drop partitions of a table that reside in offlined tablespaces, but there is an important limitation: You can only do so if no index has been defined on the table. Otherwise you get the error message "ORA-14117/ORA-14630: [sub]partition resides in offlined tablespace" when attempting to drop the partition. It's interesting to note that the mere existence of indexes is sufficient - it doesn't matter if the index is residing in an offlined or onlined tablespace, if it is unusable or not - you just need to have at least one index defined on the table, and the drop partition operation errors out. It looks like a hard coded "if index exists then error" code path. I'm not sure why exactly this restriction is in place, but it has serious consequences if you need to do this sort of cleanup for a large database with hundreds of thousands of partitions.

By the way it is interesting to know that Oracle allows to drop segments from offlined tablespaces that are locally managed - think about the consequence: The local bitmaps that represent the allocated and free space in the tablespace can not be updated at the time of the drop operation. This is one of the few disadvantages compared to the old-fashioned dictionary managed tablespaces where an operation like that potentially could have been limited to a pure dictionary operation.

The approach Oracle chooses is straightforward: The segments dropped are converted into "temporary" objects (kind of a special "recyclebin" if you want to say so) and if the offlined tablespace should get onlined again the local bitmaps will be updated according to these temporary objects and finally the temporary objects will be removed from the dictionary. Otherwise if the offlined tablespace gets dropped the corresponding temporary objects consequently can also be removed.

Tanel Poder recently also blogged about this in the context of read-only tablespaces.

So you could now drop all indexes from the affected tables and recreate them afterwards, but this is not a good idea for several reasons:
* If the remaining partitions still represent multiple terabytes it will take quite long to rebuild all indexes
* Things might go wrong and you end up with missing indexes

Since these are partitioned objects, another Oracle best practice might come to rescue: Exchange the affected partitions with an exchange table excluding indexes - this allows to drop the exchanged partition residing in an offlined tablespace without any error message. Note that this means that after you've exchanged the partition you need to drop the unpartitioned object (that doesn't have indexes) in order to get rid of the affected segment.

Now if you only have a few objects that are affected or only a single partition to clean up per table then you're basically done, but if you need to clean up multiple partitions and the number of partitions is high then this approach is not feasible, because it sounds like quite an overhead to drop / create a table for each partition that needs to be cleaned up - in particular when talking about hundred thousands of partitions.

The approach of creating a table to exchange partitions with adds another complexity to the problem: Many large databases make usage of the option in Oracle to set columns unused instead of actually dropping them, since the former is a simple meta data operation in the dictionary (renames the column and marks the column as hidden) and independent from the data volume and therefore almost instant but the latter requires Oracle to process the whole segment which can take quite some time in case of huge segments (and is not a very efficient operation by the way, but that is something for a different post).

Now creating a table that can be exchanged with such an evolved table that contains unused columns can not be accomplished by simply doing a handy CTAS operation - the unused columns will be missing from the copy and therefore the exchange operation will fail with "column mismatch" errors.

So you need to deal with these unused columns somehow in that moment you need to create a table dynamically, and you need to do that if you need to drop more than a single partition per table.

The only other viable option that I'm aware of is to maintain this exchange table as a partitioned table itself permanently - which means create two exact copies (in terms of dictionary meta data) of every potentially affected table and make sure to apply the same DDL in the same order which will ensure that the exchange operation with those tables will succeed.

It will need two copies, one unpartitioned and one partitioned, if you want to avoid the drop and dynamic create table operation, since Oracle allows to exchange partitions only with unpartitioned tables. So it would take two exchange operations per affected partition - the first with the unpartitioned table, and the second with the partitioned table. Both exchange operations would have to be done excluding indexes and at least the final partitioned table has to be created excluding indexes - note that Oracle allows the exchange operation even if one table is missing a potentially created primary key constraint if you use the "EXCLUDING INDEXES" clause.

Then you can drop the partition that has been exchanged into the second table since the table doesn't have any indexes defined on it.

This allows you to accomplish two objectives:

1. Avoid a costly (and potentially complex in case of unused columns) drop / create table operation per partition
2. Manage the drop / create cycle by simply dropping and adding partitions in the second table which is far less overhead compared to a drop / create table operation which is important when performance matters

However maintaining two copies for every table that potentially needs to be cleaned up sounds like a huge overhead, and even in Oracle 11.2 with the deferred segment creation at least the partitioned table copy will allocate space since the deferred segment creation option is not (yet) applicable to partitioned objects.

So here a few ideas how to accomplish such a cleanup operation:

1. For each table to process create the required table copies on the fly dynamically once. If you don't have to deal with unused columns then a simple CTAS operation can be used to create those tables. If you need to handle unused columns, a more complex approach is required. You will have to use custom queries against the data dictionary to extract the required information since the official DBMS_METADATA interface as far as I know doesn't expose information about unused columns.

The following query could be used as a starting point to extract the column definition of a table:

-- Get the column list including hidden columns, but ignoring virtual columns added by function-based indexes
-- TODO: What about virtual columns (11g or later)?
select
decode(t.hidden_column,
'YES', 'SYS$' || internal_column_id || '$UNUSED',
t.column_name
) ||
' ' ||
lower(data_type) ||
decode(substr(data_type, 1, 9),
'TIMESTAMP', null,
'INTERVAL ', null,
decode(t.data_type,
'NUMBER', decode(t.data_precision || t.data_scale,
null, null,
'(' ||
decode(t.data_precision,
null, '*',
t.data_precision
) ||
decode(t.data_scale,
null, null,
',' || t.data_scale
) ||
')'
),
'FLOAT', '(' || t.data_precision || ')',
'LONG', null,
'LONG RAW', null,
'BLOB', null,
'CLOB', null,
'NCLOB', null,
'BFILE', null,
'CFILE', null,
'BINARY_FLOAT', null,
'BINARY_DOUBLE', null,
'MLSLABEL', null,
'ROWID', null,
'UROWID', null,
'DATE', null,
'(' ||
nvl(
nvl(
decode(t.char_length,
0, to_number(null),
t.char_length
),
t.char_col_decl_length
),
t.data_length
) ||
decode(t.character_set_name,
'CHAR_CS', decode(t.char_used,
'C', ' char',
'B', ' byte'
)
) ||
')'
)
) ||
' ' ||
decode(nullable,
'N', 'not null',
null
) ||
',' as column_def
from
all_tab_cols t
where
t.owner =
and t.table_name =
and t.segment_column_id is not null
order by
t.internal_column_id;

This will generate SYS$$UNUSED columns for all columns set to unused in the source table which can then subsequently be set to unused in the newly created table to have the same layout as the source table.

Note that it hasn't been tested with virtual columns yet, but should deal with most of the available data types and also handle columns using char semantics correctly (important for NCHAR / NVARCHAR based columns and databases running with multi-byte characters, in particular AL32UTF8). It also can handle virtual columns added via function-based indexes.

Of course the extraction process potentially needs to handle much more than shown here, for instance check the table properties like index organized tables, partition keys etc.

2. Process each affected table partition using the following approach:

* Add a new "clean" partition to the partitioned copy table in a tablespace that is online. The partition key is a "dummy" key that can be the same for each iteration of this process. This is the segment that is going to end up in the affected partitioned table and will be finally dropped since it then resides in a available tablespace. Remember, we want to get rid of those partitions residing in unavailable tablespaces, since we have not sufficient space in the clone of the database to make them available)

* Exchange the affected partition with the unpartitioned copy table EXCLUDING INDEXES (The unpartitioned table copy has also been initially created in an available tablespace)

* Exchange the unpartitioned copy table (that now holds the defect partition) with the newly created "clean" partition of the partitioned copy table EXCLUDING INDEXES

* Now the defect partition resides in the partitioned copy table and can simply be dropped since we don't have any indexes defined on this table

* Finally we can drop now the partition from the affected table (that is now the partition that has been added to the partitioned copy table in the previous iteration of the process) since the partition resides now in an available tablespace

The same process can be applied to a subpartitioned table. There are of course some more details to consider, for example, you can't drop the last subpartition or partition from a partition / table, but since we assume that our affected table in general still has some partitions left over that reside in current and available tablespaces it should be possible to handle these scenarios.

Here is a code snippet / template that applies this process:

/**
* Drop a (sub-)partition residing in an offline tablespace
* @param p_object The original object to clean up
* @param p_temp_exc_table_name The name of the working table
* @param p_clean_tablespace_name The tablespace where to create the cleanup partition
* @param p_partition_name The name of the (sub-)partition to drop
* @param p_segment_type The type of the segment (TABLE[ ][SUB][PARTITION])
* @param p_last_partition_indicator For subpartitions indicate that this is
* the last subpartition - this will drop the whole partition
**/
procedure drop_partition(
p_object in r_oracle_object
, p_temp_exc_table_name in out oracle_object
, p_clean_tablespace_name in oracle_object
, p_partition_name in oracle_object
, p_segment_type in oracle_object
, p_last_partition_indicator in varchar2
)
as
s_sql large_sql;
s_partition_type varchar2(255);
s_exchange_table oracle_object;
e_partition_in_offline_ts exception;
pragma exception_init(e_partition_in_offline_ts, -14117);
-- We do this in an autonomous transaction to keep it separate
-- from any outside transactions like the "queue" transaction
-- used by the parallel clean up threads
pragma autonomous_transaction;
begin
-- Determine if this is a partition or subpartition
s_partition_type := substr(p_segment_type, 7, 255);

-- Step 1: Add a clean partition to the temporary cleanup table
s_sql := 'alter table ' || p_temp_exc_table_name || ' add partition p_cleanup values (42) tablespace ' || p_clean_tablespace_name;

execute(p_object.owner, s_sql);

-- Get the name of the unpartitioned table copy
s_exchange_table := get_exchange_table(p_object);

-- Step 2: Exchange the defect (sub-)partition with the unpartitioned copy table EXCLUDING indexes
s_sql := '
alter table ' || p_object.segment_name || '
exchange ' || s_partition_type || ' ' || p_partition_name || '
with table ' || s_exchange_table || '
excluding indexes without validation';

execute(p_object.owner, s_sql);

-- Step 3: Exchange the unpartitioned copy table with the partitioned copy table partition again EXCLUDING indexes
s_sql := '
alter table ' || p_temp_exc_table_name || '
exchange partition p_cleanup
with table ' || s_exchange_table || '
excluding indexes without validation';

execute(p_object.owner, s_sql);

-- Step 4: Drop the defect partition now residing in partitioned "waste-bin"table partition
-- Since we don't have any indexes this is possible now
s_sql := 'alter table ' || p_temp_exc_table_name || ' drop partition p_cleanup';

begin
execute(p_object.owner, s_sql);
exception
-- Any tables with LOBs will cause an exception since they will implicitly have a LOB INDEX added
-- Therefore the DROP PARTITION will fail with ORA-14117
-- The simplest solution is to drop and recreate the entire working table - note the serious side effects
-- of the underlying ORA-01110 error in 11.1.0.7. See below comments about the potential impact
-- of the health check monitoring
-- A smoother way of handling this to avoid these potential side-effects is to check if the
-- table has any indexes defined on it and then do not attempt to drop the partition but
-- immediately drop the entire table
when e_partition_in_offline_ts then
drop_temporary_exc_table(p_object, p_temp_exc_table_name);
p_temp_exc_table_name := create_temporary_exc_table(p_object, p_clean_tablespace_name, p_segment_type);
end;

-- Step 5: Drop the partition / subpartition which completes the cleanup
-- TODO: Are really all subpartitions of a partition affected?
-- If not, this logic needs to be revised
if (s_partition_type = 'SUBPARTITION' and p_last_partition_indicator = 'Y') then
s_sql := 'alter table ' || p_object.segment_name || ' drop partition ' || get_parent_partition(p_object, p_partition_name);

execute(p_object.owner, s_sql);
elsif s_partition_type = 'PARTITION' then
s_sql := 'alter table ' || p_object.segment_name || ' drop partition ' || p_partition_name;

execute(p_object.owner, s_sql);
end if;
end drop_partition;

Note that tables with LOBs need special treatment since the logic used for tables without LOBs fails with an error that the partition resides in an offline tablespace when the "defect" partition gets dropped from the temporary cleanup table.

This is the error that is typically raised when attempting to drop a partition from a table having indexes defined on it. Very likely this error is caused by the fact that a LOB always has a corresponding LOB INDEX (that will be always exchanged when doing an EXCHANGE PARTITION even if EXCLUDING INDEXES is specified).

Since this index cannot be dropped from the temporary cleanup table we actually do not gain anything by performing the exchange exercise.
The simplest solution to the problem is to drop and recreate the cleanup table which is possible without the ORA-14117 error message being raised.

Note that attempting to drop the partition raising the ORA-14117 error has a serious side effect in 11.1.0.7 and 11.2.0.1: If the "Health Monitoring" detects a "flooding" ORA-1110 error (something is wrong with a datafile) any operation on these datafiles from then on seem to have tracing turned on - this leads to the odd situation that onlining such a tablespace again (which we can't do in our actual case since the datafiles are not there, but I've done for testing purposes) takes minutes and generates trace files in size between 20 and 500 MB (!).

I haven't investigated much into this, but in my tests this couldn't be avoided by setting the undocumented parameter "_disable_health_checks=TRUE".

The MMON_SLAVE process will dump something similar to this:

========= Dump for error ORA 1110 (no incident) ========
----- DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

Also for the sessions that cause the error something like this will be written to trace files:

DDE rules only execution for: ORA 1110
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 17 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----

A smoother way of handling this to avoid these potential side-effects is to check if the table has any indexes defined on it and then do not attempt to drop the partition but immediately drop the entire table.

Since each iteration of this process takes some time (in a database with hundreds of thousands of partitions this can take a few seconds per partition due to the underlying recursive data dictionary operations) and is according to some traces mostly CPU bound you can speed up the processing by using multiple processes it if you have multiple CPUs available.

Using an approach with Advanced Queueing having multiple consumers (for instance as many consumers as CPUs) we were able to cleanup more than 200,000 partitions (table plus index (sub)partitions) in less than one hour which was within the available timeframe for that operation.

Time Model Bug

Tasks that are performed via jobs in the database will be double accounted in the system time model that has been introduced with Oracle 10g.

So if you execute significant workload via DBMS_JOB or DBMS_SCHEDULER any system time model related statistic like DB Time, DB CPU etc. that gets recorded for that workload gets double accounted.

This bug is not particularly relevant since your top workloads will still be the same top workloads, because all other statistics (like Elapsed Time, CPU, Buffer Gets etc.) are not affected by the bug.

I mention it only here since the bug (see below for details) as of the time of writing can't yet be found on My Oracle Support in the bug database but I recently came across several AWR reports where the majority of workload was generated via job processes and therefore the time model statistics were effectively doubled.

It might help as a viable explanation if you sometimes wonder why an AWR or Statspack report only captures 50% or less of the recorded total DB Time or DB CPU and where this unaccounted time has gone. If a significant part of the workload during the reporting period has been performed by sessions controlled via DBMS_JOB or DBMS_SCHEDULER then probably most of the unaccounted time is actually not unaccounted but the time model statistics are wrong.

So if you have such an otherwise unexplainable unaccounted DB Time / DB CPU etc. you might want to check if significant workload during the reporting period was executed via the job system. Note that I don't say that this is the only possible explanation of such unaccounted time - there might be other reasons like uninstrumented waits, other bugs etc.

Of course all the percentages that are shown in the AWR / ADDM / Statspack reports that refer to "Percentage of DB Time" or "Percentage of DB CPU" will be too small in such cases.

If the majority of workload during the reporting period has been generated by jobs then you can safely assume that the time model statistics have to be divided by 2 (and the percentages have to be doubled). If you have a mixture of jobs and regular foreground sessions then it will be harder to derive the correct time model statistics.

Note that the "Active Session History" (ASH) is not affected by the bug - the ASH reports always were consistent in my tests regarding the DB Time (respectively the number of samples) and CPU time information.

The following simple test case can be used to reproduce the issue at will. Ideally you should have exclusive access to the test system since any other concurrent activity will affect the test results.

You might want to check the 1000000000 iterations of the simple PL/SQL loop on your particular CPU - on my test system this takes approx. 46 seconds to complete.

The first version assumes that a PERFSTAT user with an installed STATSPACK is present in the database since STATSPACK doesn't require an additional license. An AWR variant follows below.

alter session set nls_language = american nls_territory = america;

store set .settings replace

set echo on timing on define on

define iter="1000000000"

variable snap1 number

exec :snap1 := statspack.snap

declare
n_cnt binary_integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
end;
/

variable snap2 number

exec :snap2 := statspack.snap

/* Uncomment this if you want to test via DBMS_JOB
variable job_id number

begin
dbms_job.submit(:job_id, '
declare
n_cnt binary_integer;
n_status integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
n_status := dbms_pipe.send_message(''bg_job_complete'');
end;
');
end;
/

commit;
*/

/* Uncomment this if you want to test via DBMS_SCHEDULER */
begin
dbms_scheduler.create_job(
job_name => dbms_scheduler.generate_job_name
, job_type => 'PLSQL_BLOCK'
, job_action => '
declare
n_cnt binary_integer;
n_status integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
n_status := dbms_pipe.send_message(''bg_job_complete'');
end;
' , enabled => true);
end;
/

declare
pipe_status integer;
begin
pipe_status := dbms_pipe.receive_message('bg_job_complete');
end;
/

declare
pipe_id integer;
begin
pipe_id := dbms_pipe.remove_pipe('bg_job_complete');
end;
/

variable snap3 number

exec :snap3 := statspack.snap

rem set heading off pagesize 0 feedback off linesize 500 trimspool on termout off echo off verify off

prompt Enter PERFSTAT password

connect perfstat

column dbid new_value dbid noprint

select dbid from v$database;

column instance_number new_value inst_num noprint

select instance_number from v$instance;

column b_id new_value begin_snap noprint
column e_id new_value end_snap noprint

select :snap1 as b_id, :snap2 as e_id from dual;
define report_name=sp_foreground.txt

@?/rdbms/admin/sprepins

column dbid new_value dbid noprint

select dbid from v$database;

column instance_number new_value inst_num noprint

select instance_number from v$instance;

column b_id new_value begin_snap noprint
column e_id new_value end_snap noprint

select :snap2 as b_id, :snap3 as e_id from dual;
define report_name=sp_background.txt

@?/rdbms/admin/sprepins

undefine iter

@.settings

set termout on

Here is the same test case but with AWR reports (requires additional diagnostic license)

alter session set nls_language = american nls_territory = america;

store set .settings replace

set echo on timing on define on

define iter="1000000000"

column snap1 new_value awr_snap1 noprint

select dbms_workload_repository.create_snapshot as snap1 from dual;

declare
n_cnt binary_integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
end;
/

column snap2 new_value awr_snap2 noprint

select dbms_workload_repository.create_snapshot as snap2 from dual;

/* Uncomment this if you want to test via DBMS_JOB
variable job_id number

begin
dbms_job.submit(:job_id, '
declare
n_cnt binary_integer;
n_status integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
n_status := dbms_pipe.send_message(''bg_job_complete'');
end;
');
end;
/

commit;
*/

/* Uncomment this if you want to test via DBMS_SCHEDULER */
begin
dbms_scheduler.create_job(
job_name => dbms_scheduler.generate_job_name
, job_type => 'PLSQL_BLOCK'
, job_action => '
declare
n_cnt binary_integer;
n_status integer;
begin
n_cnt := 0;
for i in 1..&iter loop
n_cnt := n_cnt + 1;
end loop;
n_status := dbms_pipe.send_message(''bg_job_complete'');
end;
' , enabled => true);
end;
/

declare
pipe_status integer;
begin
pipe_status := dbms_pipe.receive_message('bg_job_complete');
end;
/

declare
pipe_id integer;
begin
pipe_id := dbms_pipe.remove_pipe('bg_job_complete');
end;
/

column snap3 new_value awr_snap3 noprint

select dbms_workload_repository.create_snapshot as snap3 from dual;

set heading off pagesize 0 feedback off linesize 500 trimspool on termout off echo off verify off

spool awr_foreground.html

select
output
from
table(
sys.dbms_workload_repository.awr_report_html(
(select dbid from v$database)
, (select instance_number from v$instance)
, &awr_snap1
, &awr_snap2
)
);

spool off

spool awr_background.html

select
output
from
table(
sys.dbms_workload_repository.awr_report_html(
(select dbid from v$database)
, (select instance_number from v$instance)
, &awr_snap2
, &awr_snap3
)
);

spool off

spool awr_diff.html

select
output
from
table(
sys.dbms_workload_repository.awr_diff_report_html(
(select dbid from v$database)
, (select instance_number from v$instance)
, &awr_snap1
, &awr_snap2
, (select dbid from v$database)
, (select instance_number from v$instance)
, &awr_snap2
, &awr_snap3
)
);

spool off

undefine awr_snap1
undefine awr_snap2
undefine awr_snap3

undefine iter

column snap1 clear
column snap2 clear
column snap3 clear

@.settings

set termout on

And here is a sample snippet from a generated Statspack report on a single CPU system with nothing else running on the system:

Normal foreground execution:

STATSPACK report for

Database DB Id Instance Inst Num Startup Time Release RAC
~~~~~~~~ ----------- ------------ -------- --------------- ----------- ---
orcl112 1 05-Aug-10 08:21 11.2.0.1.0 NO

Host Name Platform CPUs Cores Sockets Memory (G)
~~~~ ---------------- ---------------------- ----- ----- ------- ------------
XXXX Microsoft Windows IA ( 1 0 0 2.0

Snapshot Snap Id Snap Time Sessions Curs/Sess Comment
~~~~~~~~ ---------- ------------------ -------- --------- ------------------
Begin Snap: 13 05-Aug-10 08:34:17 25 1.2
End Snap: 14 05-Aug-10 08:35:05 25 1.2
Elapsed: 0.80 (mins) Av Act Sess: 1.1
DB time: 0.87 (mins) DB CPU: 0.80 (mins)

Cache Sizes Begin End
~~~~~~~~~~~ ---------- ----------
Buffer Cache: 104M Std Block Size: 8K
Shared Pool: 128M Log Buffer: 6,076K

Load Profile Per Second Per Transaction Per Exec Per Call
~~~~~~~~~~~~ ------------------ ----------------- ----------- -----------
DB time(s): 1.1 2.4 0.09 3.99
DB CPU(s): 1.0 2.2 0.08 3.68

Execution via Job/Scheduler:

STATSPACK report for

Database DB Id Instance Inst Num Startup Time Release RAC
~~~~~~~~ ----------- ------------ -------- --------------- ----------- ---
orcl112 1 05-Aug-10 08:21 11.2.0.1.0 NO

Host Name Platform CPUs Cores Sockets Memory (G)
~~~~ ---------------- ---------------------- ----- ----- ------- ------------
XXXX Microsoft Windows IA ( 1 0 0 2.0

Snapshot Snap Id Snap Time Sessions Curs/Sess Comment
~~~~~~~~ ---------- ------------------ -------- --------- ------------------
Begin Snap: 14 05-Aug-10 08:35:05 25 1.2
End Snap: 15 05-Aug-10 08:35:53 24 1.3
Elapsed: 0.80 (mins) Av Act Sess: 1.9
DB time: 1.55 (mins) DB CPU: 1.54 (mins)

Cache Sizes Begin End
~~~~~~~~~~~ ---------- ----------
Buffer Cache: 104M Std Block Size: 8K
Shared Pool: 128M Log Buffer: 6,076K

Load Profile Per Second Per Transaction Per Exec Per Call
~~~~~~~~~~~~ ------------------ ----------------- ----------- -----------
DB time(s): 1.9 92.8 0.79 7.74
DB CPU(s): 1.9 92.1 0.78 7.68

As you might have guessed my single CPU test system has not been added a second CPU when performing the same task via DBMS_SCHEDULER / DBMS_JOB yet the time model reports (almost) 2 DB Time / DB CPU seconds and active sessions per second in that case.

I have reproduced the bug on versions 10.2.0.4, 11.1.0.7 and 11.2.0.1 but very likely all versions supporting the time model are affected.

A (non-public) bug "9882245 - DOUBLE ACCOUNTING OF SYS MODEL TIMINGS FOR WORKLOAD RUN THROUGH JOBS" has been filed for it, but the fix is not available yet therefore as far as I know it is not yet part of any available patch set / PSU.

Note that there seems to a different issue with the DB CPU time model component: If you have a system that reports more CPUs than sockets (for example a Power5, Power6 or Power7 based IBM server that reports 16 sockets / 32 CPUs) then the DB CPU component gets reduced by approximately 50%, which means it is divided by 2.

This means in combination with above bug that you end up with a doubled DB Time component for tasks executed via jobs, but the DB CPU time model component is in the right ballpark since the doubled DB CPU time gets divided by 2.

I don't know if the bug fix also covers this issue, so you might want to keep this in mind when checking any time model based information.

Compression Restrictions

When using table or index compression certain restrictions apply. Since I find it always hard to gather that information from the manuals and also some of the restrictions are not documented properly or not at all, I'll attempt to summarize the restrictions that I'm aware of here:

1. Compression attribute on subpartitions
Note that you can't define the COMPRESSION on subpartition level - the subpartitions inherit this attribute from the parent partition.

It is however perfectly valid and legal to exchange a subpartition with a compressed table and that way introduce individual compression on subpartition level. It is unclear why this is only possible using this indirect method - it simply looks like a implementation restriction. Since compressed data is most suitable for data warehouse environments and these ought to use EXCHANGE [SUB]PARTITION technology anyway this restriction is probably not that crucial.

2. Number of columns
Basic heap table compression and the new OLTP (in 11.1 called "for all operations") compression are both limited to 255 columns. If you attempt to compress a table with more than 255 columns then no compression will be performed although you don't get an error message. Interestingly the COMPRESSION column in the dictionary still shows "ENABLED" which is certainly misleading.

3. DDL Restrictions - modification of table structure
As soon as a segment contains compressed data or has been marked for compression (does not apply in all cases, see below for more information), the following restrictions regarding the modification of the table structure apply:

a. Basic heap table compression will not prevent the addition of more than 255 columns via ADD COLUMN - but if you update the segment the affected blocks will be silently decompressed during the update operation. In 9.2 no schema evolution is possible with compressed segments, see the next paragraph.

b. In Oracle 9.2 a lot of modifications to the table are no longer possible with compression enabled (and this applies to partitioned tables as well if at least one partition is compressed):

- You cannot add or drop any columns, and you even can't set any column unused. The error messages are to say at least confusing, since it tells you something about virtual and object columns, but in fact it is about the enabled compression. These are serious limitations in terms of schema evolution. I'll address below what can be done to overcome this issue.

Oracle9i Enterprise Edition Release 9.2.0.8.0 - Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production

SQL> create table t_part
2 compress
3 partition by list (user_id) (
4 partition p_default values (default)
5 )
6 as
7 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);
alter table t_part set unused (created)
*
ERROR at line 1:
ORA-12996: cannot drop system-generated virtual column

SQL> alter table t_part drop unused columns;

Table altered.

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-12996: cannot drop system-generated virtual column

SQL> alter table t_part add test_col varchar2(10);
alter table t_part add test_col varchar2(10)
*
ERROR at line 1:
ORA-22856: cannot add columns to object tables

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-22856: cannot add columns to object tables

SQL>

c. In Oracle 10.2 some restrictions have been lifted but there are still a couple of limitations:

- You can now add columns and you can set columns unused but you can't drop columns nor can you add columns with default values

This allows schema evolution but has interesting side effects if you attempt to use for example a Create Table As Select (CTAS) statement to dynamically create a copy of a table and attempt to perform an EXCHANGE PARTITION operation with that newly created table: If there are unused columns in the source table then these won't be copied over to the table created via CTAS and therefore the EXCHANGE PARTITION operation will fail with column mismatch error messages. Since you can't drop the unused columns as long as there is compression enabled on the table (more on that later since it is a bit more complex with partitioned tables), you effectively can't use that approach. In order to perform EXCHANGE PARTITION with compressed tables and evolved schemas that include unused columns you have basically three choices:

- Maintain the "EXCHANGE" table along with the target table which means apply the same DDL in the same order to both tables in order to keep them synchronized

- Extract the hidden columns from the dictionary and create a exchange table with the correct hidden/unused columns which can turn into a complex operation if you need to cover all possible column data types and features including LOBs etc.

- Perform some kind of rebuild operation on the target table to get rid of the unused columns. Some ideas regarding this point are going to follow further below

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t_part
2 compress
3 partition by list (user_id) (
4 partition p_default values (default)
5 )
6 as
7 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part add test_col varchar2(10);

Table altered.

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL>

d. Oracle 11 introduces Advanced Compression which is COMPRESS FOR ALL OPERATIONS in 11.1 or OLTP compression in 11.2 (don't confuse this with HCC compression levels)

Basic compression in 11.1 has the same limitations as 10.2, but with advanced compression (called COMPRESS FOR ALL OPERATIONS in 11.1) all mentioned schema evolution restrictions have been lifted as it can be seen below.

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t_part
2 --compress
3 compress for all operations
4 partition by list (user_id) (
5 partition p_default values (default)
6 )
7 as
8 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);

Table altered.

SQL> alter table t_part drop unused columns;

Table altered.

SQL> alter table t_part drop column username;

Table altered.

SQL> alter table t_part add test_col varchar2(10);

Table altered.

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL>

This is more of theoretical nature since you usually won't mix basic with advanced compression, bit things become more interesting when mixing basic and advanced compression in different partitions of the same table:

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t_part
2 compress
3 --compress for all operations
4 partition by list (user_id) (
5 partition p_default values (default)
6 )
7 as
8 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part add test_col varchar2(10);

Table altered.

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part split partition p_default values (123)
2 into (partition p_123 compress for all operations, partition p_default);

Table altered.

SQL> select
2 partition_name
3 , compression
4 , compress_for
5 from
6 user_tab_partitions
7 where
8 table_name = 'T_PART';

PARTITION_NAME COMPRESS COMPRESS_FOR
------------------------------ -------- ------------------
P_DEFAULT ENABLED DIRECT LOAD ONLY
P_123 ENABLED FOR ALL OPERATIONS

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL>

Notice the inconsistent behaviour: The first DDL command after having added a partition with advanced compression succeeds, but the following DDLs fail with an error message.

If the order of these DDLs is changed, then it becomes obvious that always the first DDL is successful no matter which of those failing otherwise comes first.

Furthermore in 11.1.0.7 if you have a mixture of Advanced Compression (COMPRESS FOR ALL OPERATIONS) and NOCOMPRESS partitions still above restrictions apply - only if you have set all partitions to COMPRESS FOR ALL OPERATIONS then the restrictions are lifted which looks more like a bug than a feature.

I'll go into more details below which settings influence this behaviour.

e. Oracle 11.2 introduces Hybrid Columnar Compression (HCC) which is only enabled in conjunction with ExaData

In 11.2 basic compression still has the same restrictions as already mentioned above, and if at least one partition of a partitioned table has advanced compression enabled then some of restrictions are lifted - even with a mixture of basic and advanced compression or a mixture of no compression and advanced compression, however adding a column with a default value is still only supported if all partitions are compressed with Advanced Compression. I'm not sure if this is really a feature since at least mixing Advanced Compression with no compression adding a column with default value should be possible.

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t_part
2 compress
3 --compress for all operations
4 partition by list (user_id) (
5 partition p_default values (default)
6 )
7 as
8 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part add test_col varchar2(10);

Table altered.

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part split partition p_default values (123)
2 into (partition p_123 compress for all operations, partition p_default);

Table altered.

SQL> select
2 partition_name
3 , compression
4 , compress_for
5 from
6 user_tab_partitions
7 where
8 table_name = 'T_PART';

PARTITION_NAME COMPRESS COMPRESS_FOR
------------------------------ -------- ------------
P_123 ENABLED OLTP
P_DEFAULT ENABLED BASIC

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop unused columns;

Table altered.

SQL> alter table t_part drop column username;

Table altered.

SQL> alter table t_part move partition p_default compress for all operations;

Table altered.

SQL> select
2 partition_name
3 , compression
4 , compress_for
5 from
6 user_tab_partitions
7 where
8 table_name = 'T_PART';

PARTITION_NAME COMPRESS COMPRESS_FOR
------------------------------ -------- ------------
P_123 ENABLED OLTP
P_DEFAULT ENABLED OLTP

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL>

When using HCC the most important difference is that it is not restricted to 255 columns - it successfully compresses segments with any number of columns (up to the hard limit of 1,000 columns).

HCC can only be used with direct-path operations, any conventional DML will decompress the affected blocks and use OLTP compression instead - which means that in case the segment has more than 255 columns the compression will effectively be disabled since any non-HCC compression supports only up to 255 columns.

HCC has only a few restrictions regarding schema evolution - the most noticeable one is that DROP COLUMN is always turned into a SET COLUMN UNUSED, so with HCC enabled you can't drop columns since DROP UNUSED COLUMNS is not supported and raises an error. This means that dropping / adding columns works towards the hard limit of 1,000 columns with HCC enabled.

Again this has the same side effect as described above - a simple CTAS operation to create a table to be exchanged with won't work any longer with unused columns.

SQL> create table t_part
2 compress for query high
3 --compress for all operations
4 partition by list (user_id) (
5 partition p_default values (default)
6 )
7 as
8 select * from all_users;

Table created.

SQL> alter table t_part set unused (created);

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;

Table altered.

SQL> alter table t_part add test_col varchar2(10);

Table altered.

SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL> alter table t_part split partition p_default values (123)
2 into (partition p_123 compress for all operations, partition p_default);

Table altered.

SQL> select
2 partition_name
3 , compression
4 , compress_for
5 from
6 user_tab_partitions
7 where
8 table_name = 'T_PART';

PARTITION_NAME COMPRESS COMPRESS_FOR
------------------------------ -------- ------------
P_123 ENABLED OLTP
P_DEFAULT ENABLED QUERY HIGH

SQL> alter table t_part add test_col3 varchar2(10) default 'BLA';

Table altered.

SQL> alter table t_part drop unused columns;
alter table t_part drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table t_part drop column username;
alter table t_part drop column username
*
ERROR at line 1:
ORA-00904: "USERNAME": invalid identifier

SQL> alter table t_part move partition p_default compress for all operations;

Table altered.

SQL> select
2 partition_name
3 , compression
4 , compress_for
5 from
6 user_tab_partitions
7 where
8 table_name = 'T_PART';

PARTITION_NAME COMPRESS COMPRESS_FOR
------------------------------ -------- ------------
P_123 ENABLED OLTP
P_DEFAULT ENABLED OLTP

SQL> alter table t_part add test_col4 varchar2(10) default 'BLA';

Table altered.

SQL>

And here is what happens if you're already hitting the 1,000 columns limit and attempt to use DROP [UNUSED] COLUMNS with HCC enabled:

SQL> alter table many_x_cols drop (col256);

Table altered.

SQL> alter table many_x_cols drop unused columns;
alter table many_x_cols drop unused columns
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> alter table many_x_cols add (col256 varchar2(10));
alter table many_x_cols add (col256 varchar2(10))
*
ERROR at line 1:
ORA-01792: maximum number of columns in a table or view is 1000

SQL> alter table many_x_cols move pctfree 0 nocompress;

Table altered.

SQL> alter table many_x_cols drop unused columns;

Table altered.

SQL> alter table many_x_cols add (col256 varchar2(10));

Table altered.

SQL>

As you can see dropping columns is only supported with HCC disabled which allows in this particular case to add another column after dropping the hidden/unused one.

- Index restrictions

* You cannot compress a table or a partition of table (or exchange a compressed partition into it) if there are usable bitmap indexes defined on the table due to the different Hakan factor that is reset when using by the compression - which means that with compressed segments more rows will fit into a single block leading to a different Hakan factor used for the bitmap indexes. The bitmap indexes need to be set unused (all partitions in case of a partitioned table/index) or dropped before enabling the compression and rebuild / recreated after compressing the heap segment. This is also officially documented. It is however interesting to note that this restriction is only enforced when switching from uncompressed to compressed segments / partitions. Changing the compression for example from basic compression to HCC compression is possible without hitting this restriction but due to the different compression levels in my opinion this potentially leads again to significantly different Hakan factors - so actively resetting the Hakan Factor using "ALTER TABLE ... MINIMIZE RECORDS_PER_BLOCK" and rebuilding the bitmap indexes might be in order when moving between different compression levels.

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter table t_part move partition p_default nocompress;

Table altered.

SQL> alter table t_part move partition p_123 nocompress;

Table altered.

SQL> alter table t_part modify default attributes nocompress;

Table altered.

SQL> alter table t_part nocompress;

Table altered.

SQL> create bitmap index t_part_idx_bitmap on t_part (test_col) local;

Index created.

SQL> alter table t_part move partition p_default compress basic;
alter table t_part move partition p_default compress basic
*
ERROR at line 1:
ORA-14646: Specified alter table operation involving compression cannot be
performed in the presence of usable bitmap indexes

SQL> alter table t_part minimize records_per_block;
alter table t_part minimize records_per_block
*
ERROR at line 1:
ORA-28602: statement not permitted on tables containing bitmap indexes

SQL> alter index t_part_idx_bitmap modify partition p_default unusable;

Index altered.

SQL> alter table t_part move partition p_default compress basic;
alter table t_part move partition p_default compress basic
*
ERROR at line 1:
ORA-14646: Specified alter table operation involving compression cannot be
performed in the presence of usable bitmap indexes

SQL> alter table t_part minimize records_per_block;
alter table t_part minimize records_per_block
*
ERROR at line 1:
ORA-28602: statement not permitted on tables containing bitmap indexes

SQL> alter index t_part_idx_bitmap unusable;

Index altered.

SQL> alter table t_part move partition p_default compress basic;

Table altered.

SQL> alter table t_part minimize records_per_block;

Table altered.

SQL> alter index t_part_idx_bitmap rebuild partition p_default;

Index altered.

SQL> alter index t_part_idx_bitmap rebuild partition p_123;

Index altered.

SQL> alter table t_part move partition p_default nocompress update indexes;

Table altered.

As you can see switching on the compression is only allowed if all partitions of the bitmap index(es) are unusable - the opposite is not true: You can uncompress afterwards without any restrictions.

* This is related to B*Tree index compression and again more of theoretical nature but might become relevant when trying to introduce B*Tree index compression partition-wise via EXCHANGE PARTITION: You can't compress a single partition of a partitioned B*Tree index (or exchange a partition with a compressed index into the table) if the whole index has been not initially created with compression enabled. If you drop the index and recreate the whole index with compression enabled then you can turn the index compression individually on and off per index partition.

Very likely this is related to the restriction that the B*Tree index compression prefix length needs to be the same across all index partitions - this can only be defined on global level and I assume this is the reason why the index needs to be created compressed first. Since you usually want to have all partitions of an partitioned index to be compressed the same way this is probably a rather irrelevant restriction (except for the mentioned case regarding EXCHANGE PARTITION)

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create index t_part_idx on t_part (user_id, test_col) nocompress local;

Index created.

SQL> alter index t_part_idx rebuild partition p_default compress;
alter index t_part_idx rebuild partition p_default compress
*
ERROR at line 1:
ORA-28659: COMPRESS must be specified at object level first

SQL> select
2 partition_name
3 , compression
4 from
5 user_ind_partitions
6 where
7 index_name = 'T_PART_IDX';

PARTITION_NAME COMPRESS
------------------------------ --------
P_123 DISABLED
P_DEFAULT DISABLED

SQL> drop index t_part_idx;

Index dropped.

SQL> create index t_part_idx on t_part (user_id, test_col) compress local;

Index created.

SQL> alter index t_part_idx rebuild partition p_default nocompress;

Index altered.

SQL> select
2 partition_name
3 , compression
4 from
5 user_ind_partitions
6 where
7 index_name = 'T_PART_IDX';

PARTITION_NAME COMPRESS
------------------------------ --------
P_123 ENABLED
P_DEFAULT DISABLED

SQL> drop index t_part_idx;

Index dropped.

SQL> create index t_part_idx on t_part (user_id, test_col) nocompress global
2 partition by range(user_id) (
3 partition p_default values less than (maxvalue)
4 );

Index created.

SQL> alter index t_part_idx rebuild partition p_default compress;
alter index t_part_idx rebuild partition p_default compress
*
ERROR at line 1:
ORA-28659: COMPRESS must be specified at object level first

SQL> drop index t_part_idx;

Index dropped.

SQL> create index t_part_idx on t_part (user_id, test_col) compress global
2 partition by range(user_id) (
3 partition p_default values less than (maxvalue)
4 );

Index created.

SQL> alter index t_part_idx rebuild partition p_default nocompress;

Index altered.

SQL> select
2 partition_name
3 , compression
4 from
5 user_ind_partitions
6 where
7 index_name = 'T_PART_IDX';

PARTITION_NAME COMPRESS
------------------------------ --------
P_DEFAULT DISABLED

SQL>

- Overcoming any of the schema evolution restrictions by temporarily uncompressing / disabling enabled compression

As already mentioned above there are different options how to overcome schema evolution restrictions. One obvious option, although very likely not applicable in many cases simply due to space and load constraints, is to temporarily uncompress the segments and to disable the compression. It is interesting to note that it is not as simple as it seems to undo compression with partitioned tables as I will demonstrate - I think I've even read somewhere that it is not possible at all to revert a table into a state where the mentioned restrictions no longer apply. This is definitely not true.

First of all it is important to note that in addition to the obvious decompression of each partition that contains compressed data the "COMPRESS" attribute that determines if a suitable operation will generate compressed data or not needs to be reset to NOCOMPRESS (but see below for "oddities"). So there is a significant difference between "ALTER ... [NO]COMPRESS" and "ALTER ... MOVE [NO]COMPRESS". The latter reorganizes the segment / partition and compresses/uncompresses the data while doing so, the former just "marks" the segment / partition as "enabled/disabled for compression" but doesn't touch the data and leaves it in whatever state it is at present. As as side note, the "COMPRESSION" column in the dictionary can not be used to distinguish between these two - you can have the column show "DISABLED" with data still compressed, and you can have the column show "ENABLED" with no compressed data and you can end up with a mixture of both - not every block of a segment is necessarily in a compressed state.

Furthermore the "COMPRESSION" attribute can be defined on multiple levels with partitioned tables:

- On the global TABLE level: ALTER TABLE ... [NO]COMPRESS

- On the partition level as default attributes: ALTER TABLE ... MODIFY DEFAULT ATTRIBUTES [NO]COMPRESS

- On the partition level to mark for [NO]COMPRESSION: ALTER TABLE ... MODIFY PARTITION [NO]COMPRESS

- On the partition level with MOVE: ALTER TABLE ... MOVE PARTITION [NO]COMPRESS ... [UPDATE [GLOBAL] INDEXES]

Now in order to be able to overcome the schema evolution restrictions all levels except the partition default attribute level need to be addressed: Any compressed data needs to be uncompressed, but also any partition that has been marked for compression (but doesn't necessarily contain compressed data) needs to be modified as well - finally the global table level also needs to be reset.

The setting on global table level seems to be the crucial step. It looks like Oracle checks during this operation (ALTER TABLE ... NOCOMPRESS) if all subordinate (sub-)partitions have a flag set that marks them as uncompressed. Only if that is the case some internal flag on global table level can also be reset so that the restrictions no longer apply.

There are however some oddities that need to be considered when doing so:

- You need to "move" any partition (ALTER TABLE ... MOVE PARTITION NOCOMPRESS) that has marked for compression if any other partition actually held compressed data - simply "unmarking" it using "ALTER TABLE ... MODIFY PARTITION NOCOMPRESS" is not sufficient even it does not hold any compressed data (for example, has only be marked with "ALTER TABLE ... MODIFY PARTITION COMPRESS").

- If you have BITMAP INDEXES defined on the table with compressed partitions you need to either set the whole indexes unusable or drop them before the COMPRESSION can be effectively disabled on global table level using ALTER TABLE ... NOCOMPRESS

See the following test case for all steps to unwind this:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table t_part
2 nocompress
3 partition by list (user_id) (
4 partition p_default values (default)
5 )
6 as
7 select * from all_users;

Table created.

SQL> alter table t_part split partition p_default values (123)
2 into (partition p_123, partition p_default);

Table altered.

SQL> -- Compress one partition
SQL> alter table t_part move partition p_default compress;

Table altered.

SQL> -- And uncompress it again
SQL> alter table t_part move partition p_default nocompress;

Table altered.

SQL> -- Doesn't work!
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> -- Compress one partition
SQL> alter table t_part move partition p_default compress;

Table altered.

SQL> -- And uncompress it again
SQL> alter table t_part move partition p_default nocompress;

Table altered.

SQL> -- The global table level was missing!
SQL> alter table t_part nocompress;

Table altered.

SQL> -- Now it works
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL> -- Start again
SQL> alter table t_part drop column test_col2;

Table altered.

SQL> -- Compress one partition
SQL> alter table t_part move partition p_default compress;

Table altered.

SQL> -- Mark in addition another partition for compression (note: no data gets compressed!)
SQL> alter table t_part modify partition p_123 compress;

Table altered.

SQL> -- And uncompress the compressed partition again
SQL> alter table t_part move partition p_default nocompress;

Table altered.

SQL> -- Unmark the other partition marked for compression (note: no data was compressed!)
SQL> alter table t_part modify partition p_123 nocompress;

Table altered.

SQL> -- The global table level
SQL> alter table t_part nocompress;

Table altered.

SQL> -- But: Doesn't work!
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> -- Oddity 1: Although it doesn't contain compressed data
SQL> -- this partition needs to be reorganized as well
SQL> alter table t_part move partition p_123 nocompress;

Table altered.

SQL> -- Don't forget the global table level, otherwise below doesn't work
SQL> alter table t_part nocompress;

Table altered.

SQL> -- Now it works
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL> -- Start again
SQL> alter table t_part drop column test_col2;

Table altered.

SQL> -- Compress one partition
SQL> alter table t_part move partition p_default compress;

Table altered.

SQL> -- Add a bitmap index
SQL> create bitmap index t_part_idx_bitmap on t_part (username) local;

Index created.

SQL> -- And uncompress the compressed partition again
SQL> alter table t_part move partition p_default nocompress update indexes;

Table altered.

SQL> -- The global table level
SQL> alter table t_part nocompress;

Table altered.

SQL> -- But: Doesn't work!
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> -- Oddity 2: Bitmap indexes on compressed tables need to be set unusable or dropped
SQL> -- Note it is not sufficient to set single partitions unusable
SQL> alter index t_part_idx_bitmap unusable;

Index altered.

SQL> -- But still: Doesn't work because the global level has not been touched again
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';
alter table t_part add test_col2 varchar2(10) default 'BLA'
*
ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

SQL> -- The global table level
SQL> alter table t_part nocompress;

Table altered.

SQL> -- Now it works
SQL> alter table t_part add test_col2 varchar2(10) default 'BLA';

Table altered.

SQL>

DBMS_STATS - Gather table statistics with many columns

Here is another good reason why you probably don't want to use tables with too many columns.

The first good reason is that Oracle stores rows of conventional heap tables with more than 255 columns (where at least one column value after the 255th is non-null) in multiple row pieces even when the entire row may fit into a single block (and up to 11.2 doesn't support basic and OLTP heap table compression on tables with more than 255 columns). This leads to something that is sometimes called "intra-row chaining" which means that Oracle needs to follow the row piece pointer to access columns after the 255th one leading to multiple logical I/Os per row, up to four for a row approaching the hard limit of 1,000 columns.

Of course such a multiple-piece row easily can result in actual row chaining where the row pieces are scattered across different blocks now potentially leading to additional random single block I/O, but I digress...

This simple example allows to see this (and the other, yet to describe) effect in action - it creates by default a table with 1,000 columns with 10,000 rows where each row resides in a separate block (and therefore only intra-row chaining should occur). I used a 8K block size tablespace with Manual Segment Space Management (MSSM) in order to avoid the ASSM overhead in this case:

----------------------------------------------------------------------------------------------------------
-- MANY_X_COLS.SQL
-- Usage: @MANY_X_COLS [NUM_COLS] [PCTFREE] [PCTUSED] [TABLESPACE] [COL1] [COL2] [COL3] [COL4] [NUM_ROWS]
-- Defaults: [1000 ] [99 ] [1 ] [TEST_8K ] [001 ] [256 ] [512 ] [768 ] [10000 ]
--
-- Create a table MANY_X_COLS with a configurable number of columns
-- and populate four columns of it using skewed data (normal distribution)
----------------------------------------------------------------------------------------------------------

set termout off verify off
REM Save current settings
store set .settings replace

REM Command line handling

REM Preparation for default value handling
column 1 new_value 1
column 2 new_value 2
column 3 new_value 3
column 4 new_value 4
column 5 new_value 5
column 6 new_value 6
column 7 new_value 7
column 8 new_value 8
column 9 new_value 9

select
'' as "1"
, '' as "2"
, '' as "3"
, '' as "4"
, '' as "5"
, '' as "6"
, '' as "7"
, '' as "8"
, '' as "9"
from
dual
where
rownum = 0;

REM Default values
select
nvl('&&1', '1000') as "1"
, nvl('&&2', '99') as "2"
, nvl('&&3', '1') as "3"
, nvl('&&4', 'TEST_8K') as "4"
, nvl('&&5', '001') as "5"
, nvl('&&6', '256') as "6"
, nvl('&&7', '512') as "7"
, nvl('&&8', '768') as "8"
, nvl('&&9', '10000') as "9"
from
dual;

define n_iter = &1
define n_pctfree = &2
define n_pctused = &3
define s_tblspace = &4
define s_col1 = &5
define s_col2 = &6
define s_col3 = &7
define s_col4 = &8
define n_num_rows = &9

set termout on echo on verify on

declare
s_sql1 varchar2(32767);
s_sql varchar2(32767);

e_table_or_view_not_exists exception;
pragma exception_init(e_table_or_view_not_exists, -942);
begin
for i in 1..&n_iter loop
s_sql1 := s_sql1 || 'col' || to_char(i, 'FM9000') || ' varchar2(10),';
end loop;

s_sql1 := rtrim(s_sql1, ',');

s_sql := 'drop table many_x_cols purge';

begin
execute immediate s_sql;
exception
when e_table_or_view_not_exists then
null;
end;

s_sql := 'create table many_x_cols (' || s_sql1 || ') pctfree &n_pctfree pctused &n_pctused tablespace &s_tblspace';

execute immediate s_sql;

dbms_random.seed(0);

s_sql := 'insert /*+ append */ into many_x_cols (
col&s_col1
, col&s_col2
, col&s_col3
, col&s_col4
)
select
trunc(sys.dbms_random.normal * &n_num_rows.) as col&s_col1
, trunc(sys.dbms_random.normal * &n_num_rows.) as col&s_col2
, trunc(sys.dbms_random.normal * &n_num_rows.) as col&s_col3
, trunc(sys.dbms_random.normal * &n_num_rows.) as col&s_col4
from
dual
connect by
level <= &n_num_rows
';

execute immediate s_sql;

commit;
end;
/

set termout off
REM Restore and cleanup
undefine 1 2 3 4 5 6 7 8 9
column 1 clear
column 2 clear
column 3 clear
column 4 clear
column 5 clear
column 6 clear
column 7 clear
column 8 clear
column 9 clear
@.settings
set termout on

If you have created the table with the defaults and now run some simple queries on it, for example:

SET AUTOTRACE TRACEONLY STATISTICS

SELECT COUNT(*) FROM MANY_X_COLS;

you should see approximately 10,000 consistent gets after some initial runs to avoid the overhead of hard parsing. Your output should look similar to the following:

SQL> SELECT COUNT(*) FROM MANY_X_COLS;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
10004 consistent gets
10000 physical reads
0 redo size
422 bytes sent via SQL*Net to client
415 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

However if you start to access columns after the 255th one (and you've used the defaults or suitable values to populate at least one these columns), then you'll notice that the number of consistent gets increases whereas the number of physical reads stays more or less the same:

SET AUTOTRACE TRACEONLY STATISTICS

SELECT COUNT(COL256) FROM MANY_X_COLS;

SQL> SELECT COUNT(COL256) FROM MANY_X_COLS;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
30004 consistent gets
10000 physical reads
0 redo size
427 bytes sent via SQL*Net to client
415 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

It is interesting to note that in 11.1.0.7 and 11.2.0.1 it happened sometimes that the consistent gets jumped directly from 10,000 to 30,000 and increased then to 40,000 and 50,000 respectively when accessing later columns whereas in 10.2.0.4 the expected result of 20,000, 30,000 and 40,000 was seen.

It is also interesting to note that the statistic "table fetch continued row" is only increasing once per row which could be another interesting variation of the theme "How does Oracle record this statistic".

OK, back to the main topic for today: DBMS_STATS and such tables with many columns.

The other good reason is that DBMS_STATS might switch to a multi-pass mode if a table exceeds a certain number of columns. This means even if you request only basic column statistics (METHOD_OPT=>"FOR ... COLUMNS SIZE 1...") without any additional histograms that require additional passes anyway DBMS_STATS.GATHER_TABLE_STATS will perform multiple passes (full segment scans usually) on the same segment. This is even true if basic column statistics are only requested for a limited set of columns (for example "FOR COLUMNS SIZE 1 COL1, COL2, COL3") since Oracle gathers some column statistics always no matter what options are passed to DBMS_STATS (very likely in order to calculate for example the average row length for the table statistics).

The threshold when DBMS_STATS switches to multiple passes depends the options passed to DBMS_STATS and differs between Oracle versions 11g (11.1 and 11.2) and versions prior to 11g - and I'm not talking about the multiple passes that are used in versions prior to 11g to determine the optimal auto sample size (DBMS_STATS.AUTO_SAMPLE_SIZE, see below for more information).

In Oracle 10.2.0.4 DBMS_STATS is capable of handling a minimum of approximately 160 columns per pass if basic columns statistics are requested and a maximum of approximately 440 columns if no column statistics are requested, which means that it requires between three and seven passes to gather statistics for a 1,000 columns table. Below is what such a typical query looks like - and you'll find up to seven of them in a SQL trace file of a DBMS_STATS call with a different column list for a 1,000 columns table.

Notice that in 10.2.0.4 DBMS_STATS gathers five different values per column for basic column statistics:

- Number of non-null values
- Number of distinct values
- Column size
- Minimum value
- Maximum value

If no column statistics are requested then still the "Number of non-null values" and "Column size" per column are gathered which explains why DBMS_STATS is capable of handling more columns per pass since simply less information per column is requested.

select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
*/ count(*),count("COL257"),count(distinct "COL257"),
sum(sys_op_opnsize("COL257")),substrb(dump(min("COL257"),16,0,32),1,120),
substrb(dump(max("COL257"),16,0,32),1,120),count("COL258"),count(distinct
"COL258"),sum(sys_op_opnsize("COL258")),substrb(dump(min("COL258"),16,0,32),
1,120),substrb(dump(max("COL258"),16,0,32),1,120),count("COL259"),
count(distinct "COL259"),sum(sys_op_opnsize("COL259")),
substrb(dump(min("COL259"),16,0,32),1,120),substrb(dump(max("COL259"),16,0,
32),1,120),...
.
.
.
from
"CBO_TEST"."MANY_X_COLS" t

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.20 0.19 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 1.21 3.07 9713 40010 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 1.41 3.26 9713 40010 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT GROUP BY (cr=40010 pr=9713 pw=0 time=3070537 us)
10000 TABLE ACCESS FULL MANY_X_COLS (cr=40010 pr=9713 pw=0 time=1820245 us)

Tracing the DBMS_STATS.GATHER_TABLE_STATS call furthermore reveals that Oracle performs at least one additional query when using DBMS_STATS.AUTO_SAMPLE_SIZE (the default) which is probably used for determining the optimal sample size because Oracle uses an iterative approach prior to 11g to determine the auto sample size. The sample size is increased in steps until the number of distinct values per column passes some statistical checks. The interesting thing however seems to be that these additional queries are repeated per pass:

select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
*/ count(*)
from
"CBO_TEST"."MANY_X_COLS" sample block ( 1.0000000000,1) t

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 7 0.01 0.00 0 0 0 0
Execute 7 0.00 0.00 0 0 0 0
Fetch 7 0.12 3.67 1971 764 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 21 0.13 3.67 1971 764 0 7

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=119 pr=332 pw=0 time=530678 us)
108 TABLE ACCESS SAMPLE MANY_X_COLS (cr=119 pr=332 pw=0 time=646431 us)

********************************************************************************

select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
*/ count(*)
from
"CBO_TEST"."MANY_X_COLS" t

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 7 0.00 0.00 0 0 0 0
Execute 7 0.00 0.00 0 0 0 0
Fetch 7 3.41 15.85 65858 70070 0 7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 21 3.41 15.86 65858 70070 0 7

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=10010 pr=9445 pw=0 time=2228265 us)
10000 TABLE ACCESS FULL MANY_X_COLS (cr=10010 pr=9445 pw=0 time=816529 us)

This means that in total between six (no column statistics) and at least 14 queries (basic column statistics for all columns) are used to gather the statistics for a 1,000 columns table in 10.2.0.4 with DBMS_STATS.AUTO_SAMPLE_SIZE.

Using a fixed sample size (or compute) it takes between three and seven queries to gather the statistics for a 1,000 columns table in 10.2.0.4. When computing the basic statistics for all columns this means a minimum of 220,000 consistent gets to gather the statistics on a 10,000 blocks table with 1,000 columns in comparison to approximately 20,000 consistent gets to perform the same on a 255 columns table - not taking into account the CPU overhead of performing the aggregates (count distinct, min, max, etc.) on all those columns.
Note that these consistent gets exclude the work required to maintain the dictionary and statistics history information.

The additional queries executed when using DBMS_STATS.AUTO_SAMPLE_SIZE in 10.2.0.4 increase the number of consistent gets to 300,000 for the 1,000 columns / 10,000 blocks table (in this case the automatism ends up in a compute) and 40,000 consistent gets for the 255 columns / 10,000 blocks table.

Of course the intention of the AUTO_SAMPLE_SIZE is to find an optimal sample size and not to actually increase the amount of work compared to a compute gather statistics, but in this particular case this was true.

This significant difference in consistent gets between the 1,000 columns and 255 columns table both having 10,000 blocks has three reasons:

1. It requires more consistent gets per pass due to the multiple row pieces (up to four times more consistent gets)
2. DBMS_STATS performs many more passes on the 1,000 columns table (when gathering basic column statistics for all columns at least seven on the 1,000 columns table compared to at least two on the 255 columns table)
3. Obviously DBMS_STATS doesn't make a systematic attempt to minimize the number of consistent gets per pass by trying to access only leading columns per pass (so to require only 10,000 / 20,000 / 30,000 / 40,000 gets per pass). It looks more like the columns accessed per pass are randomly chosen often causing the maximum number of consistent gets per pass

The same applies to version 11.1 and 11.2 as long as fixed sample sizes are used. However when using DBMS_STATS.AUTO_SAMPLE_SIZE the whole gather statistics logic has been revised, in particular due to the introduction of the improvements regarding number of distinct column values approximation and incremental partition statistics, which is described in great detail by Amit Poddar here.

Due to these improvements to DBMS_STATS when using DBMS_STATS.AUTO_SAMPLE_SIZE Oracle 11.1.0.7 (and 11.2.0.1) are capable of handling up to 1,000 columns in a single pass if no columns statistics are requested and a minimum of approximately 260 columns per pass if basic column statistics are requested. This means it takes four passes to gather basic column statistics on all columns for a 1,000 columns table. Furthermore due to the revised logic the additional queries used in 10.2.0.4 for the DBMS_STATS.AUTO_SAMPLE_SIZE are no longer used.

Tracing the DBMS_STATS.GATHER_TABLE_STATS call in 11.1.0.7 and 11.2.0.1 also reveals that there is a special operation activated that is not represented in SQL. Accompanied by a comment outlining what the special NDV operation is supposed to do DBMS_STATS seems to activate a "APPROXIMATE NDV AGGREGATE" operation that is very likely used for the aforementioned improvement regarding the NDV algorithm.

Such a query looks then like the following:

select /*+ full(t) no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
no_substrb_pad */to_char(count("COL174")),to_char(substrb(dump(min("COL174")
,16,0,32),1,120)),to_char(substrb(dump(max("COL174"),16,0,32),1,120)),
to_char(count("COL175")),to_char(substrb(dump(min("COL175"),16,0,32),1,120))
,to_char(substrb(dump(max("COL175"),16,0,32),1,120)),...
.
.
.
from
"CBO_TEST"."MANY_X_COLS" t /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,
NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
.
.
.
NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.11 0.10 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 1.01 2.24 9845 30010 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 1.12 2.35 9845 30010 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 93 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 SORT AGGREGATE (cr=30010 pr=9845 pw=0 time=0 us)
8755 APPROXIMATE NDV AGGREGATE (cr=30010 pr=9845 pw=0 time=2336556 us cost=2 size=121114 card=82)
10000 TABLE ACCESS FULL MANY_X_COLS (cr=30010 pr=9845 pw=0 time=300223 us cost=2 size=121114 card=82)

The following differences to 10g become obvious:

1. The SQL part has been reduced to

- Number of non-null values
- Minimum value
- Maximum value

for basic column statistics. The number of distinct values is obviously performed as part of the special operation "APPROXIMATE NDV AGGREGATE" as it can be seen in the "Row Source Operation" (for further details refer to Amit Poddar's document). Note that in case of no column statistics this is reduced to a single expression per column which is the "Number of non-null values".

2. The special operation "APPROXIMATE NDV AGGREGATE" appears in the execution plan

3. The trailing comment section of the statement describes what the "APPROXIMATE NDV AGGREGATE" is supposed to perform, for more details refer to the aforementioned document by Amit Poddar

4. All expressions are converted to strings using TO_CHAR

5. A new hint NO_SUBSTRB_PAD has been introduced (example taken from 11.2.0.1)

Probably the fact that the number of expressions in the SQL part has been reduced allows Oracle 11g to process more columns per pass.

This improved variant requires approximately 150,000 consistent gets (four passes) for basic column statistics for all columns of a 10,000 blocks / 1,000 columns table with DBMS_STATS.AUTO_SAMPLE_SIZE in comparison to 10,000 consistent gets for table with 255 columns (single pass).

So even with the improvements of 11g approximate NDV algorithm it will still require multiple passes when requesting basic column statistics (which generally speaking is recommended) for tables with more than 260 columns. If you think about the consequence of that in case of a table that not only has many columns but also many rows (that are then at least intra-row chained) the potential overhead becomes quite obvious.

So in summary the best way to deal with such tables is to avoid them in first place - in Oracle 10g attempt to keep the number of columns below 160 respectively 260 in 11g when using DBMS_STATS.AUTO_SAMPLE_SIZE to ensure a single pass operation covering all columns.

If you really need to use that many columns it might be worth the effort to write a custom gather statistics procedure that explicitly states the columns to gather statistics for in the METHOD_OPT parameter in order to minimize the number of passes required - of course keeping in mind the potential side effects on the cost based optimizer when column statistics are missing and these become relevant for optimization. In 11g with DBMS_STATS.AUTO_SAMPLE_SIZE this potentially allows to reduce the number of passes down to a single one.

And as a side note - using the deprecated ANALYZE ... STATISTICS command requires exactly a single pass to gather basic column statistics for all 1,000 columns of the 10,000 rows table - it even records only 10,000 consistent gets (and no "table fetch continued row") which is quite interesting since it is unclear how the different row pieces are accessed then. It might however simply be an instrumentation issue. Of course in most cases it doesn't make sense to revert to ANALYZE ... STATISTICS due to its shortcomings (just to name a few: No parallelism, no statistics history, no global statistics on partitioned tables and no approximate NDV improvements in 11g) and differences to DBMS_STATS that might lead to plan changes - see also the other notes linked to in that note by Jonathan Lewis.

Advanced Oracle Troubleshooting Session - PGA/UGA memory fragmentation

A troubleshooting session at one of my clients started with the following description of the problem:

In some specific database environments sometimes (but not always) a particular batch process takes significantly longer than expected - which in that case meant hours instead of a few minutes.

It could also be observed that particular SQL statements showed up as most prominent in that case when monitoring the process.

So the first thing was to be able to reproduce the issue at will which was not the case so far.

It was quite quickly possible to narrow the problem down to a single call to a complex PL/SQL stored procedure (that was part of a package and called a lot of other stored procedures / packages) that dealt with handling of LOBs representing XML data.

It turned out that by simply calling this stored procedure multiple times in a loop the issue could be reproduced at will and also in different database environments as initially reported.

The rather unique aspect of that stored procedure was that it got executed in an "exclusive" mode which means that it was run on a quite large IBM pSeries server (AIX 5.3, 11.1.0.7, 240 GB RAM, numerous P6 multi-cores, expensive storage) with no other business processes active at the same time. It was mostly a serial execution and when it was monitored, it spent most of its time in two SQL statements that were very simple and straightforward. It actually were static SQLs embedded in PL/SQL that queried a single, very small table using a unique index access plus a table access by ROWID. So we had two SQL statements that under normal circumstances at most could generate two logical I/Os per execution, since the index/table queried was so small and therefore was completely held in the buffer cache.

When monitoring the executions of these SQLs while running the above mentioned loop it became obvious that due to the design of the logic these SQL statements very called numerous times, actually thousands of times (depending on the content of the XML LOB), per procedure call.

It also became obvious that, because it was static SQL, the PL/SQL cursor cache (and the database block cache both) did a very good job - meaning that the SQLs were only parsed once and from then on only executed, and generated only logical I/O and no physical I/O.

Furthermore due to the logic implemented most of the time these SQLs actually didn't find the value looked up in the small table which meant that on average each of the execution required only a single logical I/O - the index unique scan that required only a single logical I/O and no further activity as the value could not be found in the index.

When checking the SQL execution statistics in different environments (Linux instead of AIX) it also became obvious that the same SQL statements were executed a similar number of times, but these environments didn't experience the same long runtime of the batch job, so the sheer number of SQL executions looked odd but didn't seem to be the root cause of the problem.

So we ended up with the following initial summary:

- We identified a stored procedure that deals with LOBs representing XML as culprit
- Due to the logic most of the time was spent in two static SQLs that got executed thousands of times per execution of the stored procedure
- However these SQLs used the "optimal" execution plan and actually generated only a single logical I/O per execution
- Furthermore due to the "exclusive" manner of the batch job no contention with other processes was possible
- So there are no usual suspects like "inefficient execution plans", "contention for resources", "serialization" etc.
- Therefore the usual performance monitoring methods like wait interface, extended SQL trace, active session history, session statistics / system statistics, ASH / AWR / ADDM reports didn't reveal any obvious culprit
- Executing the SQL statements identified in a separate process showed that they took on average 10 microseconds (!) per execution - which is quite fast and raised the question how it was possible that these statements showed up as the problematic ones - executed in a simple loop ten thousands of executions were possible within a single second.

So we were at a point where we basically had the classic "The Wait Interface Is Useless (Sometimes)" situation. All we could see is that most of the time was spent executing these SQLs and all we saw was that this time was spent on the single CPU executing these statements.

But wait, there was an interesting point showing up when monitoring the session statistics delta (e.g. using Tanel Poder's "snapper" utility): The process required more and more PGA/UGA memory while executing.

So it was time for advanced troubleshooting techniques - and for instance OakTable fellow Tanel Poder has done a lot of research in this area and provides numerous helpful tools.

Further checks revealed a few anomalies that allowed us to get an idea in which direction to investigate further:

- The PGA/UGA consumption of the process was increasing with every iteration of the loop executing the PL/SQL stored procedure. So there was a memory problem somewhere hidden. A good idea therefore seemed to be to take a PGA/UGA heapdump of the process to see what kind of memory chunks could be identified.

Caution: Taking heapdumps may crash your system or process, therefore be very cautious when doing so in a production-like environment. Taking PGA heapdumps usually only affects a single process and is therefore not that critical (unless it was a background process for instance...).

This is how you can take a heapdump using EVENTS:

ALTER SESSION SET EVENTS 'immediate trace name heapdump level ';

This is how you can request a heapdump using ORADEBUG:
AS SYSDBA after identifying the process using SETOSPID, SETORAPID OR SETMYPID:

ORADEBUG DUMP HEAPDUMP

See also Julian Dyke's website for an overview of the available heapdump levels.

In this case a level 4 + 1 = level 5 heapdump (PGA + UGA summary) seemed to to sufficient (although the current/user callheap might also be of interest which is level 8 + 16 + 4 + 1 = 29).

In our chapter 8 of the Expert Oracle Practices from the OakTable book Charles and I also describe the most important heapdump variants and the other performance monitoring methods applied here.

The trace file written can then be analyzed for example using Tanel's heapdump_analyzer shell script to get an initial overview - in this case since it was quite a huge trace file I simply used a slightly modified version of the awk program used by heapdump_analyzer to generate an input file for the Oracle external table feature to run more sophisticated reports on the file.

This is the Unix command that can be used to transform the raw trace into a suitable input for an Oracle external table:

cat $TRACE_FILE | awk '
/^HEAP DUMP heap name=/ { split($0,ht,"\""); HTYPE=ht[2]; doPrintOut = 1; }
/Chunk/{ if ( doPrintOut == 1 ) {
split($0,sf,"\"");
printf "%10d , %16s, %16s, %16s\n", $4, HTYPE, $5, sf[2];
}
}
/Total heap size/ {
doPrintOut=0;
}
' > $EXT_TAB_DIR/heapdump.txt

This is how this external table definition could look like:

create table heapdump
(
chunk_size integer
, heap_type varchar2(16)
, chunk_type varchar2(16)
, alloc_reason varchar2(16)
)
organization external
(
type oracle_loader
default directory ext_tab_dir
access parameters
(
records delimited by newline
fields terminated by ','
lrtrim
)
location ('heapdump.txt')
);

- The process "leaked" temporary lobs - in V$TEMPORARY_LOBS an ever increasing number of LOBs could be observed, however the increase was slow (one per execution of the stored procedure) and the space consumption in the temporary tablespace was low

- When calculating the average execution time of the two SQL statements getting executed many times it became obvious that these actually started as quickly observed in the separate execution but continuously slowed down - they became slower and slower over time

- After a couple of minutes these statements that initially took a few microseconds took already milliseconds (!) to execute - it was no wonder that the batch job took hours, since it still required to execute these statements numerous times, but now each execution took more than thousand times longer than in the beginning

- Looking closer it became obvious that every SQL statement that got executed was affected by that slowdown, but since many of these statements took several milliseconds anyway and were executed not that many times, an "overhead" of a few milliseconds didn't really make a significant difference. It were those statements that were executed that often that incurred the biggest penalty

- Eventually when the loop completed that executed the PL/SQL stored procedure, it became obvious that basically everything that was done within that session was affected, for example a simple SELECT * FROM DUAL or a null PL/SQL block like BEGIN NULL; END; took something like 30 centiseconds (0.3 secs!) to complete - doing the same in a fresh session (while the other session was still open) completed instantly (less than the 0.01 seconds measured by the SQL*Plus timing facility) - it was something that was specific to the session and not an instance-wide effect.

So there was something that obviously took more and more CPU time that added an overhead to basically "every" operation, no matter what was performed.

Time to go one level deeper and take traces on the O/S level to find out where a process was spending most of its time. Unfortunately this was AIX 5.3, so the range of available tools to perform such a measurement was quite limited. There is no DTrace on AIX, and the new probevue tool is only available from AIX 6 on. So we were left with Oracle's built-in oradebug short_stack and AIX's procstack tool to take samples of the call stack of the foreground process.

But Tanel is there again for a rescue - using his "OStackProf" tool set we were able to take samples of the call stack, only to find out that the process spent most of its time in memory management calls, and that was only an assumption since no-one could tell us what these sampled function names exactly meant. According to the MOS document 175982.1 these were related to memory management of ADTs (Oracle objects).

This is what such a stack trace sample looked like:

Below is the stack prefix common to all samples:
------------------------------------------------------------------------
Frame->function()
------------------------------------------------------------------------
# 36 ->main()
# 35 ->opimai_real()b0
# 34 ->sou2o()
# 33 ->opidrv()
# 32 ->opiodr()b98
# 31 ->opiino()f0
# 30 ->opitsk()
# 29 ->ttcpip()c
# 28 ->opiodr()b98
# 27 ->kpoal8()c
# 26 ->opiexe()
# 25 ->kkxexe()c
# 24 ->peicnt()
# 23 ->plsql_run()
# 22 ->pfrrun()
# 21 ->pfrrun_no_tool()c
# 20 ->pfrinstr_EXECC()c
# 19 ->pevm_EXECC()e4
# 18 ->psdnal()c
# 17 ->psddr0()
# 16 ->rpidrv()
# ...(see call profile below)
#
# -#--------------------------------------------------------------------
# - Num.Samples -> in call stack()
# ----------------------------------------------------------------------
76 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()e4->auddft()b8->audbeg()->ktcxbr0()a0->kocbeg()->koctxbg()->kohalc()->kohalmc()->kghfru()c->44c0->->
20 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()b8c->kksfbc()ec->ktcsna()c->ktucloGetGlobalMinScn()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->ksuvrl()c->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()->ksuprc()c->44c0->->

The "auddft"/"audbeg" function names might indicate something related to auditing, but the database didn't have any auditing enabled, apart from the "audit_trail = DB" setting. Alex Fatkulin recently reported an issue with auditing enabled slowing down 11.2.0.1 and indeed when setting "audit_trail = NONE" in 11.1.07 or 11.2.0.1 the slow-down experienced was less which indicates that there are actually some code paths related to auditing involved, however it didn't solve the issue - there was still a significant slow-down observed.

The crucial hint came from the analysis of the PGA/UGA heapdump:
The PGA/UGA heapdump showed hundred of thousands very small chunks that seem to relate to XML handling ("qmxdpls_subhea") - hence the obvious assumption was that something related to XML object memory management was probably going wrong.

Therefore we focused on the code parts that performed the XML processing - and very quickly a crucial bug was identified: The DBMS_XMLDOM.FREEDOCUMENT call was missing from the code.

Apparently this bug was in the code right from the beginning but the application was initially developed on 10.2 which interestingly does not show the same slow down. It shows the same symptoms however, like increasing PGA/UGA memory consumption and leaked temporary lobs, but it doesn't experience the same slow-down. Now that the database has been upgraded to 11.1.0.7 some time ago obviously the slow-down problems started to surface.

So by simply adding or removing the freedocument call from the code, the issue could be reproduced / fixed.

This enabled us to come up with a generic testcase that allows to reproduce the issue at will in most environments running 11.1.0.7 or later.

Note that it looks like that the effect seems to depend on the port of Oracle - Oracle 11.1.0.7 on Linux x64 showed a bit different behaviour - only some SQL statements were affected by the slow-down, but not every statement as it seems to be the case with the AIX port.

drop table t_testloop;

purge table t_testloop;

create table t_testloop (
id integer not null
, vc varchar2(255)
, constraint t_testloop_pk primary key (id)
)
;

insert into
t_testloop
(
id
, vc
)
select
level as id
, rpad('x', 100, 'x') as vc
from
dual
connect by
level <= 100;

commit;

exec dbms_stats.gather_table_stats(null, 'T_TESTLOOP')

-- This is supposed to be a INDEX UNIQUE SCAN + TABLE ACCESS BY ROWID
explain plan for
select
id
, vc
from
t_testloop
where
id = to_number(:x);

set linesize 160
set pagesize 999

select * from table(dbms_xplan.display);

set timing on echo on serveroutput on

-- This is the normal (reference) execution time for running
-- the simple statement a thousand times
declare
procedure check_key
as
x integer;
n_id integer;
s_vc varchar2(255);
begin
x := 42 * 3;
select
id
, vc
into
n_id
, s_vc
from
t_testloop
where
id = x;
exception
when NO_DATA_FOUND then
null;
end;
begin
for i in 1..1000 loop
check_key;
end loop;
end;
/

-- "Deterministic" randomness :-))
exec dbms_random.seed(0)

declare
start_time number;
end_time number;

-- Generate some CLOB containing XML
-- Note that it looks like the CLOB needs
-- to be different for every iteration
-- otherwise the issue couldn't be reproduced
function return_clob return clob
as
the_lob clob;
s_name varchar2(20);
n_sal integer;
s_job varchar2(20);
begin
the_lob := '
';
for i in 1..20 loop
s_name := dbms_random.string('U', trunc(dbms_random.value(1, 21)));
n_sal := trunc(dbms_random.value(1, 1001));
s_job := dbms_random.string('U', trunc(dbms_random.value(1, 21)));
the_lob := the_lob || '
' || to_char(i, 'TM') || '
' || s_name || '
' || to_char(n_sal, 'TM') || '
' || s_job || '

';
end loop;
the_lob := the_lob || '
';

return the_lob;
end return_clob;

-- Some usage of the PL/SQL XML DOM API
-- Some dummy processing of the attributes of the given node
procedure process_attributes
(
in_node dbms_xmldom.DOMNode
)
is
len number;
n dbms_xmldom.DOMNode;
nnm dbms_xmldom.DOMNamedNodeMap;
key varchar2(1000);
val varchar2(32767);
BEGIN
nnm := dbms_xmldom.getAttributes(in_node);

if (dbms_xmldom.isNull(nnm) = FALSE) then
len := dbms_xmldom.getLength(nnm);

-- loop through attributes
for i in 0..len-1 loop
n := dbms_xmldom.item(nnm, i);
key := dbms_xmldom.getNodeName(n);
val := dbms_xmldom.getNodeValue(n);
end loop;
end if;

end process_attributes;

-- Some usage of the PL/SQL XML DOM API
-- Recursively walk the nodes of the DOM
-- and call the attribute processing per node
procedure walk_node
(
in_node dbms_xmldom.DOMNode
)
is
nl dbms_xmldom.DOMNodeList;
len number;
n dbms_xmldom.DOMNode;
node_name varchar2(100);
begin
-- loop through elements
node_name:=dbms_xmldom.getNodeName(in_node);

process_attributes(in_node);

nl := dbms_xmldom.getChildNodes(in_node);
len := dbms_xmldom.getLength(nl);
for i in 0..len-1 loop
n := dbms_xmldom.item(nl, i);
node_name := dbms_xmldom.getNodeName(n);
walk_node(n);
end loop;
end walk_node;

-- The main procedure
procedure process_xml_clob
as
the_lob clob;
var XMLType;
doc dbms_xmldom.DOMDocument;
root dbms_xmldom.DOMNode;
root_tag varchar2(100);
begin
-- Get the CLOB with the XML
the_lob := return_clob;

-- Instantiate an XMLTYPE
var := xmltype(the_lob);

-- Generate a new DOM document from the XMLType
-- This seems to allocate a temporary LOB under the covers
doc := dbms_xmldom.newDOMDocument(var);

-- Some rudimentary XML DOM processing
root := dbms_xmldom.makeNode(dbms_xmldom.getDocumentElement(doc));

root_tag := dbms_xmldom.getNodeName(root);

-- If you want to burn more CPU to exaggerate the effect
-- uncomment this
-- walk_node(root);

-- This omission here causes a significant PGA/UGA memory leak
-- and causes version 11.1 and 11.2 to slow down everything
-- in this session
-- Version 10.2 suffers from the same symptoms but doesn't slow down
--DBMS_XMLDOM.freeDocument(doc);
end;
begin
-- Run this a thousand times and measure / output the runtime per execution
for i in 1..1000 loop
start_time := dbms_utility.get_time;
process_xml_clob;
end_time := dbms_utility.get_time;
dbms_output.put_line('Run ' || to_char(i, 'TM') || ': Time (in seconds)= ' || ((end_time - start_time)/100));
end loop;
end;
/

-- Do the simple statement again a thousand times
-- Notice the difference in runtime when using 11.1.0.7 or 11.2.0.1
declare
procedure check_key
as
x integer;
n_id integer;
s_vc varchar2(255);
begin
x := 42 * 3;
select
id
, vc
into
n_id
, s_vc
from
t_testloop
where
id = x;
exception
when NO_DATA_FOUND then
null;
end;
begin
for i in 1..1000 loop
check_key;
end loop;
end;
/

The crucial line is the one above containing "DBMS_XMLDOM.FREEDOCUMENT". If you run this script without the FREEDOCUMENT call on a version that is affected by the slow-down, like 11.1.0.7 or 11.2.0.1, then a typical output might look like the following:

SQL> declare
2 procedure check_key
3 as
4 x integer;
5 n_id integer;
6 s_vc varchar2(255);
7 begin
8 x := 42 * 3;
9 select
10 id
11 , vc
12 into
13 n_id
14 , s_vc
15 from
16 t_testloop
17 where
18 id = x;
19 exception
20 when NO_DATA_FOUND then
21 null;
22 end;
23 begin
24 for i in 1..1000 loop
25 check_key;
26 end loop;
27 end;
28 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.94

.
.
.

Run 1: Time (in seconds)= .49
Run 2: Time (in seconds)= .08
Run 3: Time (in seconds)= .08
Run 4: Time (in seconds)= .08
Run 5: Time (in seconds)= .05
Run 6: Time (in seconds)= .03
Run 7: Time (in seconds)= .03
Run 8: Time (in seconds)= .03
Run 9: Time (in seconds)= .03
Run 10: Time (in seconds)= .02
Run 11: Time (in seconds)= .03
Run 12: Time (in seconds)= .03
Run 13: Time (in seconds)= .03
Run 14: Time (in seconds)= .03
Run 15: Time (in seconds)= .03
Run 16: Time (in seconds)= .03
Run 17: Time (in seconds)= .02
Run 18: Time (in seconds)= .03
Run 19: Time (in seconds)= .03
Run 20: Time (in seconds)= .03
Run 21: Time (in seconds)= .03
Run 22: Time (in seconds)= .03
Run 23: Time (in seconds)= .03
Run 24: Time (in seconds)= .03
Run 25: Time (in seconds)= .03
.
.
.
Run 287: Time (in seconds)= .03
Run 288: Time (in seconds)= .03
Run 289: Time (in seconds)= .03
Run 290: Time (in seconds)= .03
Run 291: Time (in seconds)= .02
Run 292: Time (in seconds)= .03
Run 293: Time (in seconds)= .03
Run 294: Time (in seconds)= .03
Run 295: Time (in seconds)= .03
Run 296: Time (in seconds)= .03
Run 297: Time (in seconds)= .03
Run 298: Time (in seconds)= .02
Run 299: Time (in seconds)= .03
Run 300: Time (in seconds)= .03
Run 301: Time (in seconds)= .03
Run 302: Time (in seconds)= .03
Run 303: Time (in seconds)= .17
Run 304: Time (in seconds)= .17
Run 305: Time (in seconds)= .17
Run 306: Time (in seconds)= .17
Run 307: Time (in seconds)= .17
Run 308: Time (in seconds)= .17
Run 309: Time (in seconds)= .17
Run 310: Time (in seconds)= .17
Run 311: Time (in seconds)= .18
Run 312: Time (in seconds)= .17
Run 313: Time (in seconds)= .18
Run 314: Time (in seconds)= .18
Run 315: Time (in seconds)= .18
Run 316: Time (in seconds)= .17
Run 317: Time (in seconds)= .19
Run 318: Time (in seconds)= .18
Run 319: Time (in seconds)= .18
Run 320: Time (in seconds)= .19
Run 321: Time (in seconds)= .18
Run 322: Time (in seconds)= .19
.
.
.
Run 973: Time (in seconds)= .82
Run 974: Time (in seconds)= .83
Run 975: Time (in seconds)= .83
Run 976: Time (in seconds)= .82
Run 977: Time (in seconds)= .83
Run 978: Time (in seconds)= .83
Run 979: Time (in seconds)= .82
Run 980: Time (in seconds)= .82
Run 981: Time (in seconds)= .83
Run 982: Time (in seconds)= .82
Run 983: Time (in seconds)= .83
Run 984: Time (in seconds)= .83
Run 985: Time (in seconds)= .82
Run 986: Time (in seconds)= .84
Run 987: Time (in seconds)= .83
Run 988: Time (in seconds)= .86
Run 989: Time (in seconds)= .84
Run 990: Time (in seconds)= .83
Run 991: Time (in seconds)= .85
Run 992: Time (in seconds)= .84
Run 993: Time (in seconds)= .84
Run 994: Time (in seconds)= .85
Run 995: Time (in seconds)= .84
Run 996: Time (in seconds)= .85
Run 997: Time (in seconds)= .84
Run 998: Time (in seconds)= .87
Run 999: Time (in seconds)= .84
Run 1000: Time (in seconds)= .85

PL/SQL procedure successfully completed.

Elapsed: 00:06:00.49
SQL>
SQL> declare
2 procedure check_key
3 as
4 x integer;
5 n_id integer;
6 s_vc varchar2(255);
7 begin
8 x := 42 * 3;
9 select
10 id
11 , vc
12 into
13 n_id
14 , s_vc
15 from
16 t_testloop
17 where
18 id = x;
19 exception
20 when NO_DATA_FOUND then
21 null;
22 end;
23 begin
24 for i in 1..1000 loop
25 check_key;
26 end loop;
27 end;
28 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:03.02
SQL>

Notice how after a certain number of iterations the execution gets slower and slower. And what is even more interesting is that the simple loop of 1,000 SQL statement executions afterwards takes significantly longer than before the code block got executed.

You might want to monitor the PGA/UGA consumption in the Session Statistics or V$PROCESS and V$TEMPORARY_LOBS while the block is executing.

If you run the same in 10.2.0.4, you'll monitor the same symptoms, but the execution time will stay consistent, at least up to 1,000 iterations - obviously there will be a point where a system might get trouble with the ever increasing memory consumption, however that is not the point here.

If DBMS_XMLDOM.FREEDOCUMENT is called correctly then the same code block in 11.1.0.7 and 11.2.0.1 will also show consistent execution times (and no increased PGA/UGA memory or temporary LOBs).

Interestingly Nigel Noble also has recently published a blog note about PL/SQL memory issues that are only reproducible in 11.1.0.7 but don't show up in 10.2.0.4.

So although the issues reported by Alex Fatkulin and Nigel Noble aren't necessarily related to this one here, all this seems to suggest that Oracle release 11 introduced some changes to the way memory is managed (may be due to the introduction of Automatic Memory Management (AMM) and the memory_target parameter) that seem to behave differently than in previous releases under certain circumstances.

References:
Tanel Poder's Advanced Oracle Troubleshooting series
Advanced Oracle Troubleshooting
Tanel's posts on troubleshooting

Expert Oracle Practices from the OakTable, Chapter 8

When column name matters

There’s a thread on the OTN forum with a case of amusing behavior of the 11.2.0.1 JDBC driver: it fails to detect properly all bind variables in an SQL statement referencing column named NORDER. WTF?! PS. 1) Oracle’s SSO is actually Multi Sign-On because I have two accounts (one for OTN forums and one for [...]