Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Nested Tables

This note is a little side effect of answering a question about the total space associated with a table, including its indexes, LOB columns and nested tables. The first suggestion I published failed to take account of the fact that nested tables can contain their own nested tables so I had to modify the code.

The basic change was easy – in the original code I had joined to the view dba_nested_tables to translate between a table_name and its parent_table_name. The only change I needed to make was to replace the reference to the view with a different view that joined a table_name to its “oldest ancestor”. To achieve this I had to do two things: first, create a table with multiple levels of nesting, then create a suitable view definition.  For reference – because it may help somebody – I’ve published the two stages here.

A revolting nested table:

The following code creates three table types, but the second table type


rem
rem     Script:         nested_tables.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type);
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_2 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type
)
segment creation immediate
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                )
        return as locator
        )
return as locator
/

I’ve never seen nested tables manually created in a production system though I believe they are likely to appear (along with varrays and LOBs) as a side effect of some XML or JSON mechanisms, but many years ago (in Practical Oracle 8i, possibly) I discussed the pros and cons of returning them “by value” or (as I have here) “by reference”. As you can see, you need to exercise some care with brackets and locating the text as you try to refine multiple levels of nesting.

Tne Ancestor View

We’ll take this in three steps – first, report from user_nested_tables. (The final script for reporting space used dba_nested_tables with a predicate available on the owner column, but I don’t give myself DBA privileges while writing quick and dirty bits of SQL.).


select 
        parent_table_name, parent_table_column, table_name, 
        ltrim(storage_spec) storage_spec,       -- the DBA view definition includes lpad( , 30) !!
        ltrim(return_type ) return_type         -- the DBA view definition includes lpad( , 20) !!
from 
        user_nested_tables
order by
        parent_table_name, parent_table_column
/

PARENT_TABLE_NAME    PARENT_TABLE_COLUMN  TABLE_NAME           STORAGE_SPEC         RETURN_TYPE
-------------------- -------------------- -------------------- -------------------- --------------------
DEMO_NEST_2          JPL1                 T_JPL1               USER_SPECIFIED       LOCATOR
T_JPL1               JPL2                 T_JPL2               USER_SPECIFIED       LOCATOR
T_JPL2               JPL3                 T_JPL3               DEFAULT              LOCATOR

You’ll notice the odd ltrim() – I have no idea why the view defines these columns to be left-padded the way they are, possibly it dates back to the time when something like cast(… as vachar2(30)) wasn’t a possible option.

Next a simple “connect by” query what uses the above list in a materialize “with subquery” (CTE):


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select
        parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3
T_JPL1               T_JPL2               /T_JPL2
T_JPL1               T_JPL3               /T_JPL2/T_JPL3
T_JPL2               T_JPL3               /T_JPL3


As required this shows me demo_nest_2 as the owning ancestor of t_jpl1, t_jpl2 and t_jpl3. Unfortunately it has also produced three rows that we don’t want to see in our final space-summing code. But it’s easy enough to get rid of those – the only rows we want are the rows with a parent_table that doesn’t appear as a child_table:


with my_nested_tables as (
select
        /*+ materialize */
        parent_table_name, table_name
from
        user_nested_tables
)
select  parent_table, child_table, pathname
from    (
        select
                level,
                sys_connect_by_path(table_name, '/')    pathname,
                connect_by_root parent_table_name parent_table,
                table_name child_table
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
where   (parent_table) not in (
                select table_name
                from   my_nested_tables
        )
order by
        parent_table, child_table, pathname
/

PARENT_TABLE         CHILD_TABLE          PATHNAME
-------------------- -------------------- ----------------------------------------
DEMO_NEST_2          T_JPL1               /T_JPL1
DEMO_NEST_2          T_JPL2               /T_JPL1/T_JPL2
DEMO_NEST_2          T_JPL3               /T_JPL1/T_JPL2/T_JPL3

3 rows selected.


And there’s the result we wanted to see. A first simple corroboration of the fact is that the output corresponds with the “NESTED TABLE” segments reported by user_segments.

Of course, having written a query that gets the right result from a table definition we used to help us define the query in the first place we now ought to create a few more tables with different structures to see if the query continues to give the right results in more complex cases.

What happens, for example, if I create a table with two columns of nested tables, and one of the nested table typed also contained two nested tables ? What happens if the base table is an index organized table ?

It’s easy to do the second test – just add “organization index” immediately after “segment creation immediate” in the table creation statement. The correct results drop out.

As for the first test – here’s the SQL to create one example (and the query still gets the right answers – even if you change the table to be index organized).


drop type jpl_tab1_type;
drop type jpl_item1;

drop type jpl_tab2_type;
drop type jpl_item2;

drop type jpl_tab3_type;
drop type jpl_item3;

purge recyclebin;

create or replace type jpl_item3 as object (v1 varchar2(10), n3 number);
/

create or replace type jpl_tab3_type as table of jpl_item3;
/

create or replace type jpl_item2 as object(v2 varchar2(10), jpl3 jpl_tab3_type, jpl3x jpl_tab3_type);
/

create or replace type jpl_tab2_type as table of jpl_item2;
/

create or replace type jpl_item1 as object(v3 varchar2(10), jpl2 jpl_tab2_type)
/

create or replace type jpl_tab1_type as table of jpl_item1;
/

create table demo_nest_3 (
        id      number  constraint d2_pk primary key,
        jpl1    jpl_tab1_type,
        jpl1a   jpl_tab1_type
)
segment creation immediate
-- organization index
nested table jpl1 store as t_jpl1
        (
        nested table jpl2  store as t_jpl2
                (
                nested table jpl3 store as t_jpl3 
                return as locator
                nested table jpl3x store as t_jpl3x 
                return as locator
                )
        return as locator
        )
return as locator
nested table jpl1a store as t_jpl1a
        (
        nested table jpl2  store as t_jpl2a
                (
                nested table jpl3 store as t_jpl3a
                return as locator
                nested table jpl3x store as t_jpl3xa
                return as locator
                )
        return as locator
        )
return as locator
/

All that remains now is to modify the code to use the equivalent DBA views, with the addition of the owner column, then slot the resulting code into the original query in place of the simple references to dba_nested_tables. If you go to the original posting you’ll see that I’ve done this by wrapping the code into a CTE so that the only changes to the rest of the code appear as two (flagged) changes where the CTE is then used.

PostgreSQL Invalid Page and Checksum Verification Failed

At the Seattle PostgreSQL User Group meetup this past Tuesday, we got onto the topic of invalid pages in PostgreSQL. It was a fun discussion and it made me realize that it’d be worth writing down a bunch of the stuff we talked about – it might be interesting to a few more people too!

Invalid Page In Block

You see an error message that looks like this:

ERROR: invalid page in block 1226710 of relation base/16750/27244

First and foremost – what does this error mean? I like to think of PostgreSQL as having a fairly strong “boundary” between (1) the database itself and (2) the operating system [and by extension everything else… firmware, disks, network, remote storage, etc]. PostgreSQL maintains page validity primarily on the way in and out of its buffer cache.

https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=150 150w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=300 300w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=768 768w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png 1629w" sizes="(max-width: 1024px) 100vw, 1024px" />

What does this mean in practice? Suppose there’s a physical memory failure and somehow the ECC parity is unable to detect it. This means that a little bit of physical memory on the server now has incorrect garbage and the correct data from that memory is lost.

  • If the garbage bytes map to part of the kernel page cache, then when PostgreSQL tries to copy the page into it’s buffer cache then it will (if possible) detect that something is wrong, refuse to poison its buffer cache with this invalid 8k page, and error out any queries that require this page for processing with the ERROR message above.
  • If the garbage bytes map to part of PostgreSQL’s database buffer cache, then PostgreSQL will quite happily assume nothing is wrong and attempt to process the data on the page. Results are unpredictable; probably all sorts of ERROR messages, crashes and failure modes could result – or maybe even incorrect data returned with no ERROR message at all. (Note that this is probably the case for nearly all software… and also note that ECC is pretty good.)

How PostgreSQL Checks Page Validity

PostgreSQL has two main “validity checks” that it performs on pages. You can read the code in the function PageIsVerified() but I’ll summarize here. You can tell from your error message which validity check failed. It depends on whether you see a second additional WARNING right before the ERROR. The warning would look like this:

WARNING: page verification failed, calculated checksum 3482 but expected 32232
  1. If the above warning is not present, this means the page header failed a basic sanity check. This could conceivably be caused by both problems inside and outside of PostgreSQL.
  2. If you see the above warning (page verification failed), this means the checksum recorded in the block did not match the checksum calculated for the block. This most likely indicates that there was a problem outside of (underneath) the database – operating system, memory, networking, storage, etc.

About Basic Page Header Sanity Checks

As of when I’m writing this article in 2019, the following basic sanity checks are performed on the page header:

  • There are 32 bits reserved for page flag bits; at present only three are used and the other 29 bits should always be zero/off.
  • Every page is divided into four parts (header, free space, tuples, special space).  Offsets for the divisions are stored as 16-bit numbers in the page header; the offsets should go in order and should not have a value pointing off the page.
  • The offset of the special space should always be aligned.

About PostgreSQL Checksums

PostgreSQL version 9.3 (released in 2013) added the ability to calculate a checksum on data pages and store the checksum in the page. There are two inputs to the checksum: (1) every single byte of the data page, with zeros in the four bytes where the checksum will be stored later and (2) the page offset/address. This means that PostgreSQL doesn’t just detect if a byte is changed in the page – it also detects if a perfectly valid page gets somehow dropped into the wrong place.

Checksums are not maintained for blocks while they are in the shared buffers – so if you look at a buffer in the PostgreSQL page cache with pageinspect and you see a checksum value, note that it’s probably just leftover from the last read but wouldn’t have been maintained when the page was changed. The checksum is calculated and stamped onto the page when the page is written out of the buffer cache into the operating system page cache (remember the pages get flushed to disk later by a separate fsync call).

The checksum algorithm is specifically designed to take advantage of SIMD instructions. The slowest part of the algorithm is multiplication, so if possible PostgreSQL will be compiled to perform 32 multiplications at a time in parallel. In 2013 there were two platforms specifically documented to support this: x86 SSE4.1 and ARM NEON. The implementation is specifically tuned for optimal performance on x86 SSE. This is an important point actually – particularly for folks running PostgreSQL on embedded devices – PostgreSQL certainly compiles and works well on a lot of platforms, but evidently the checksum implementation is explicitly optimized to run the fastest on Intel. (To be clear… I think everyone should use checksums on every platform unless you have a really really good reason not to – just don’t be surprised if you start comparing benchmarks with Intel and you see a difference!)

For folks interested in digging a little more into the history… here’s the original commit using a CRC algorithm which never shipped in an actual PostgreSQL release (Simon Riggs, Jeff Davis and Greg Smith), here’s the subsequent commit introducing the FNV-1a algorithm instead of CRC which is what we still use today (Ants Aasma), and here’s the commit doing the major refactor which moved the algorithm into a header file for easier external use (Tom Lane).

More Ways To Check Validity

At the SEAPUG meetup this led to a discussion about checking validity. Do checksums cover everything? (hint: no) Are there more ways we can validate our databases? (hint: yes)

I haven’t yet made a full list of which files are covered by checksums and which ones aren’t, but I know that not everything is. For example: I’m pretty sure that the visiblity map and SLRU files aren’t covered with checksums. But for what it’s worth, there are two extra tools we can use verification in PostgreSQL databases:

  • The amcheck extension can scan a B-Tree index for a number of logical problems – for example, verifying that all B-Tree pages have items in “logical” order. (This could be useful, for example, if you’re not specifying ICU collation and you recently upgraded your operating system collation libraries… since PostgreSQL uses OS collation by default.)
  • The pg_visibility_map extension includes two functions to check for corruption in the visibility map – pg_check_frozen() and pg_check_visible().

Responding To Checksum Failures

Finally, what if you actually run into a checksum failure? What should you do, and are there any additional tools you should know about?

First of all – on any database – there are a few things you should always do immediately when you see something indicating that a data corruption may have occurred:

  • Verify that your backup retention and your log file retention are sufficiently long – I recommend at least a month (this is a Happiness Hint). You never know how long the investigation might take, or how long ago something important might have happened.
  • Take a moment to articulate and write out the impact to the business. (Are important queries failing right now? Is this causing an application outage?) This seems small but it can be very useful in making decisions later. Don’t exaggerate the impact but don’t understate it either. It can also be helpful to note important timelines that you already know about. For example: management is willing to use yesterday’s backup and take a day of data loss to avoid an 12 hour outage, but not to avoid a 4 hour outage …or… management needs a status update at 11:00am Eastern Time.
  • If there’s a larger team working on the system in question, communicate a freeze on changes until things are resolved.
  • Make a list or inventory of all copies of the data. Backups, physical replicas or hot standbys, logical replicas, copies, etc. Sometimes the very process of making this list can immediately give you ideas for simple solutions (like checking if a hot standby has the block intact). The next thing you probably want to do is check all items in this list to see if they have a valid copy of the data in question. Do not take any actions to remediate the problem right away, collect all of the information first. The data you collect now might useful during RCA investigation after you’re back online.
  • If there was one checksum failure, then you probably want to make sure there aren’t more.
    • If it’s a small database, consider whether you can scan the whole thing and verify the checksum on every single block.
    • If it’s a large database, consider whether you can at least scan all of the system/catalog tables and perhaps scan the tables which are throwing errors in their entirety. (PostgreSQL stops on the first error, so there isn’t an easy way to check if other blocks in the same table also have checksum problems.)
  • A few general best practices… (1) have a second person glance at your screen before you execute any actual changes, (2) never delete anything but always rename/move instead, (3) when investigating individual blocks, also look at the block before and the block after to verify they look like what you’d normally expect, (4) test the remediation plan before running it in production, and (5) document everything. If you’ve never seen Paul Vallée’s FIT-ACER checklist then it’s worth reviewing.

There’s no single prescriptive process for diagnosing the scope of the problem and finding the right path forward for you. It involves learning what you need to know about PostgreSQL, a little creative thinking about possible resolutions, and balancing the needs of the business (for example, how long can you be down and how much data loss can you tolerate).

That being said, there are a few tools I know about which can be very useful in the process. (And there are probably more that I don’t know about; please let me know and I can add them to this list!)


Data investigation:

Unix/Linux Commands
You might be surprised at what you can do with the tools already installed on your operating system. I’ve never been on a unix system that didn’t have dd and od installed, and I find that many Linux systems have hexdump and md5sum installed as well. A few examples of how useful these tools are: dd can extract the individual block with invalid data on the primary server and extract the same block on the hot standby, then od/hexdump can be used to create a human-readable textual dump of the binary data. You can even use diff to find the differences between the blocks. If you have a standby cluster with storage-level replication then you could use md5sum to see at a glance if the blocks match. (Quick word of caution on comparing hot standbys: last I checked, PostgreSQL doesn’t seem to maintain the free space identically on hot standbys, so the checksums might differ on perfectly healthy blocks. You can still look at the diff and verify whether free space is the only difference.) Drawbacks: low-level utilities can only do binary dumps but cannot interpret the data. Also, utilities like dd are “sharp knives” – powerful tools which can cause damage if misused!

For a great example of using dd and od, read the code in Bertrand Drouvot‘s pg_toolkit script collection.
Data investigation and checksum verification:

pg_filedump
This is a crazy awesome utility and I have no idea why it’s not in core PostgreSQL. It makes an easy-to-read textual dump of the binary contents of PostgreSQL data blocks. You can process a whole file or just specify a range of blocks to dump. It can verify checksums and it can even decode the contents of the tuples. As far as I can tell, it was originally written by Patrick Macdonald at Red Hat some time in the 2000’s and then turned over to the PostgreSQL Global Development Group for stewardship around 2011. It’s now in a dedicated repository at git.postgresql.org and it seems that Tom Lane, Christoph Berg and Teodor Sigaev keep it alive but don’t invest heavily in it. Drawbacks: be aware that it doesn’t address the race condition with a running server (see Credativ pg_checksums below). For dumping only a block with a checksum problem, this is not an issue since the server won’t let the block into its buffer cache anyway.
Checksum verification:

PostgreSQL pg_checksums
PostgreSQL itself starting in version 11 has a command-line utility to scan one relation or everything and verify the checksum on every single block. It’s called pg_verify_checksums in v11 and pg_checksums in v12. Drawbacks: first, this utility requires you to shut down the database before it will run. It will throw an error and refuse to run if the database is up. Second, you can scan a single relation but you can’t say which database it’s in… so if the OID exists in multiple databases, there’s no way to just scan the one you care about.
Checksum verification:

Credativ pg_checksums
The fine engineers of Credativ have published an enhanced version of pg_checksums which can verify checksums on a running database. It looks to me like the main case they needed to protect against was the race condition between pg_checksum reading blocks while the running PostgreSQL server was writing those same blocks. Linux of course work on a 4k page size; so if an 8k database page is half written when pg_checksum reads it then we will get a false positive. The version from credativ however is smart enough to deal with this. Drawbacks: check the github issues; there are a couple notable drawbacks but this project was only announced last week and all the drawbacks might be addressed by the time you read this article. Also, being based on the utility in PostgreSQL, the same limitation about scanning a single relation applies.

Note that both Credativ’s and PostgreSQL’s pg_checksums utilities access the control file, even when just verifying checksums. As a result, you need to make sure you compile against the same version of PostgreSQL code as the target database you’re scanning.
Checksum verification:

Satoshi Nagayasu postgres-toolkit
I’m not sure if this is still being maintained, but Satoshi Nagayasu wrote postgres-toolkit quite a long time ago which includes a checksum verification utility. It’s the oldest one I have seen so far – and it still compiles and works! (Though if you want to compile it on PostgreSQL 11 or newer then you need to use the patch in this pull request.) Satoshi’s utility also has the very useful capability of scanning an arbitrary file that you pass in – like pg_filedump but stripped down to just do the checksum verification. It’s clever enough to infer the segment number from the filename and scan the file, even if the file isn’t part of a PostgreSQL installation. This would be useful, for example, if you were on a backup server and wanted to extract a single file from your backup and check if the damaged block has valid checksum in the backup. Drawbacks: be aware that it doesn’t address the race condition with a running server.
Checksum verification:

Google pg_page_verification
Simple program; you pass in a data directory and it will scan every file in the data directory to verify the checksums on all blocks. Published to Github in early 2018. Drawbacks: be aware that it doesn’t address the race condition with a running server. Probably superseded by the built-in PostgreSQL utilities.
Mitigation:

PostgreSQL Developer Options
PostgreSQL has hundreds of “parameters” – knobs and button you can use to configure how it runs. There are 294 entries in the pg_settings table on version 11. Buried in these parameters are a handful of “Developer Options” providing powerful (and dangerous) tools for mitigating data problems – such as ignore_checksum_failure, zero_damaged_pages and ignore_system_indexes. Read very carefully and exercise great care with these options – when not fully understood, they can have unexpected side effects including unintended data loss. Exercise particular care with the ignore_checksum_failure option – even if you set that in an individual session, the page will be readable to all connections… think of it as poisoning the buffer cache. That being said, sometimes an option like zero_damaged_pages is the fastest way to get back up and running. (Just make sure you’ve saved a copy of that block!) By the way… a trick to trigger a read of one specific block is to SELECT * FROM table WHERE ctid='(blockno,1)'
Mitigation:

Unix/Linux Commands
I would discourage the use of dd to mitigate invalid data problems. It’s dangerous even for experienced engineers; simple mistakes can compound the problem. I can’t imagine a situation where this is a better approach than the zero_damaged_pages developer option and a query to read a specific block. That said, I have seen cases where dd was used to zero out a page.

More About Data Investigation

In order to put some of this together, I’ll just do a quick example session. I’m running PostgreSQL 11.5 an on EC2 instance and I used dd to write a few evil bytes into a couple blocks of my database.

First, lets start by just capturing the information from the log files:

$ grep "invalid page" ../log/postgresql.log|sed 's/UTC.*ERROR//'
 2019-10-15 19:53:37 :  invalid page in block 0 of relation base/16385/16493
 2019-10-16 22:26:30 :  invalid page in block 394216 of relation base/16385/16502
 2019-10-16 22:43:24 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:55:33 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:57:58 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:59:14 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:21 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:22 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:23 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:12 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:16 :  invalid page in block 0 of relation base/16385/16493
 2019-11-06 00:02:05 :  invalid page in block 250 of relation base/16385/16492

With a little command-line karate we can list each distinct block and see the first time we got an error on that block:

$ grep "invalid page" ../log/postgresql.log |
sed 's/UTC.*ERROR//' |
awk '{print $1" "$2" "$11" invalid_8k_block "$8" segment "int($8/131072)" offset "($8%131072)}' |
sort -k3,5 -k1,2 |
uniq -f2
   2019-11-06 00:02:05 base/16385/16492 invalid_8k_block 250 segment 0 offset 250
   2019-10-15 19:53:37 base/16385/16493 invalid_8k_block 0 segment 0 offset 0
   2019-11-05 23:59:14 base/16385/16502 invalid_8k_block 262644 segment 2 offset 500
   2019-10-16 22:26:30 base/16385/16502 invalid_8k_block 394216 segment 3 offset 1000 

So we know that there are at least 4 blocks corrupt. Lets scan the whole data directory using Credativ’s pg_checksum (without shutting down the database) to see if there are any more blocks with bad checksums:

$ pg_checksums -D /var/lib/pgsql/11.5/data |& fold -s
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1000: calculated checksum
 2ED4 but block contains 4EDF
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1010: calculated checksum
 9ECF but block contains ACBE
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.2", block 500: calculated checksum
 5D6 but block contains E459
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16493", block 0: calculated checksum E7E4
 but block contains 78F9
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16492", block 250: calculated checksum
 44BA but block contains 3ABA
 Checksum operation completed
 Files scanned:  1551
 Blocks scanned: 624158
 Bad checksums:  5
 Data checksum version: 1

Ah-ha… there was one more bad checksum which didn’t show up in the logs! Next lets choose one of the bad blocks and dump the contents using unix command line tools.

$ dd status=none if=base/16385/16492 bs=8192 count=1 skip=250 | od -A d -t x1z -w16
0000000 00 00 00 00 e0 df 6b b0 ba 3a 04 00 0c 01 80 01  >......k..:......<
0000016 00 20 04 20 00 00 00 00 80 9f f2 00 00 9f f2 00  >. . ............<
0000032 80 9e f2 00 00 9e f2 00 80 9d f2 00 00 9d f2 00  >................<
0000048 80 9c f2 00 00 9c f2 00 80 9b f2 00 00 9b f2 00  >................<
0000064 80 9a f2 00 00 9a f2 00 80 99 f2 00 00 99 f2 00  >................<
0000080 80 98 f2 00 00 98 f2 00 80 97 f2 00 00 97 f2 00  >................<

Here we see the page header and the beginning of the line pointers. One thing I think it’s easy to remember is that the first 8 bytes are the page LSN and the next 2 bytes are the page checksum. Notice that the page checksum bytes contain “ba 3a” which matches the error message from the scan above (3ABA). Sometimes it can be useful to know just the very top of the page even if you don’t remember anything else!

This is useful, but lets try the pg_filedump utility next. This utility takes a lot of options. In this example I’m going to ask it to verify the checksum (-k), only scan one block at offset 250 (-R 250 250) and even to decode the tuples (table row data) to a human-readable format (-D int,int,int,charN). There’s another argument (-f) that can even tell pg_filedump to show hexdump/od style raw data inline but I won’t demonstrate that here.

$ pg_filedump -k -R 250 250 -D int,int,int,charN base/16385/16492

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 11.0
*
* File: base/16385/16492
* Options used: -k -R 250 250 -D int,int,int,charN
*
* Dump created on: Fri Nov  8 21:48:38 2019
*******************************************************************

Block  250 ********************************************************
----- Block Offset: 0x001f4000 Offsets: Lower 268 (0x010c) Block: Size 8192 Version 4 Upper 384 (0x0180) LSN: logid 0 recoff 0xb06bdfe0 Special 8192 (0x2000) Items: 61 Free Space: 116 Checksum: 0x3aba Prune XID: 0x00000000 Flags: 0x0004 (ALL_VISIBLE) Length (including item array): 268 Error: checksum failure: calculated 0x44ba. ------ Item 1 -- Length: 121 Offset: 8064 (0x1f80) Flags: NORMAL COPY: 15251 1 0 Item 2 -- Length: 121 Offset: 7936 (0x1f00) Flags: NORMAL COPY: 15252 1 0 Item 3 -- Length: 121 Offset: 7808 (0x1e80) Flags: NORMAL COPY: 15253 1 0

That was the block header and the first few item. (Item pointer data is displayed first, then the table row data itself is displayed on the following line after the word COPY.) Looking down a little bit, we can even see where I wrote the bytes “budstuff” into a random location in this block – it turns out those bytes landed in the middle of a character field. This means that without checksums, PostgreSQL would not have thrown any errors at all but just returned an incorrect string the next time that row was queried!

COPY: 15297	1	0
 Item  48 -- Length:  121  Offset: 2048 (0x0800)  Flags: NORMAL
COPY: 15298	1	0
 Item  49 -- Length:  121  Offset: 1920 (0x0780)  Flags: NORMAL
COPY: 15299	1	0	                                           badstuff
 Item  50 -- Length:  121  Offset: 1792 (0x0700)  Flags: NORMAL
COPY: 15300	1	0

It’s immediately clear how useful this is (and easier to read ). The part where it decodes the table row data into a human readable form is an especially cool trick. Two notes about this.

  • First, the lines are prefixed with the word COPY for a reason – they are actually intended to be formatted so you can grep on the word COPY and then use the “copy” command (or it’s psql cousin) to feed the data directly back into a staging table in the database for cleanup. How cool is that!
  • Second, it can decode only a set of fairly standard data types and relation types.
$ pg_filedump -h

Version 11.0 (for PostgreSQL 11.x)
Copyright (c) 2002-2010 Red Hat, Inc.
Copyright (c) 2011-2018, PostgreSQL Global Development Group

Usage: pg_filedump [-abcdfhikxy] [-R startblock [endblock]] 
          [-D attrlist] [-S blocksize] [-s segsize] [-n segnumber] file

Display formatted contents of a PostgreSQL heap/index/control file
Defaults are: relative addressing, range of the entire file, block
               size as listed on block 0 in the file

The following options are valid for heap and index files:
...
...
...
  -D  Decode tuples using given comma separated list of types
      Supported types:
        bigint bigserial bool char charN date float float4 float8 int
        json macaddr name oid real serial smallint smallserial text
        time timestamp timetz uuid varchar varcharN xid xml
      ~ ignores all attributes left in a tuple

Now you see what happens when I start having fun… a quick update about our SEAPUG meetup this past Tuesday turned into a blog article that’s way too long. :) Hope it’s useful, and as always let me know what I can improve!

PostgreSQL Invalid Page and Checksum Verification Failed

At the Seattle PostgreSQL User Group meetup this past Tuesday, we got onto the topic of invalid pages in PostgreSQL. It was a fun discussion and it made me realize that it’d be worth writing down a bunch of the stuff we talked about – it might be interesting to a few more people too!

Invalid Page In Block

You see an error message that looks like this:

ERROR: invalid page in block 1226710 of relation base/16750/27244

First and foremost – what does this error mean? I like to think of PostgreSQL as having a fairly strong “boundary” between (1) the database itself and (2) the operating system [and by extension everything else… firmware, disks, network, remote storage, etc]. PostgreSQL maintains page validity primarily on the way in and out of its buffer cache.

https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=150 150w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=300 300w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=768 768w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png 1629w" sizes="(max-width: 1024px) 100vw, 1024px" />

What does this mean in practice? Suppose there’s a physical memory failure and somehow the ECC parity is unable to detect it. This means that a little bit of physical memory on the server now has incorrect garbage and the correct data from that memory is lost.

  • If the garbage bytes map to part of the kernel page cache, then when PostgreSQL tries to copy the page into it’s buffer cache then it will (if possible) detect that something is wrong, refuse to poison its buffer cache with this invalid 8k page, and error out any queries that require this page for processing with the ERROR message above.
  • If the garbage bytes map to part of PostgreSQL’s database buffer cache, then PostgreSQL will quite happily assume nothing is wrong and attempt to process the data on the page. Results are unpredictable; probably all sorts of ERROR messages, crashes and failure modes could result – or maybe even incorrect data returned with no ERROR message at all. (Note that this is probably the case for nearly all software… and also note that ECC is pretty good.)

How PostgreSQL Checks Page Validity

PostgreSQL has two main “validity checks” that it performs on pages. You can read the code in the function PageIsVerified() but I’ll summarize here. You can tell from your error message which validity check failed. It depends on whether you see a second additional WARNING right before the ERROR. The warning would look like this:

WARNING: page verification failed, calculated checksum 3482 but expected 32232
  1. If the above warning is not present, this means the page header failed a basic sanity check. This could conceivably be caused by both problems inside and outside of PostgreSQL.
  2. If you see the above warning (page verification failed), this means the checksum recorded in the block did not match the checksum calculated for the block. This most likely indicates that there was a problem outside of (underneath) the database – operating system, memory, networking, storage, etc.

About Basic Page Header Sanity Checks

As of when I’m writing this article in 2019, the following basic sanity checks are performed on the page header:

  • There are 32 bits reserved for page flag bits; at present only three are used and the other 29 bits should always be zero/off.
  • Every page is divided into four parts (header, free space, tuples, special space).  Offsets for the divisions are stored as 16-bit numbers in the page header; the offsets should go in order and should not have a value pointing off the page.
  • The offset of the special space should always be aligned.

About PostgreSQL Checksums

PostgreSQL version 9.3 (released in 2013) added the ability to calculate a checksum on data pages and store the checksum in the page. There are two inputs to the checksum: (1) every single byte of the data page, with zeros in the four bytes where the checksum will be stored later and (2) the page offset/address. This means that PostgreSQL doesn’t just detect if a byte is changed in the page – it also detects if a perfectly valid page gets somehow dropped into the wrong place.

Checksums are not maintained for blocks while they are in the shared buffers – so if you look at a buffer in the PostgreSQL page cache with pageinspect and you see a checksum value, note that it’s probably just leftover from the last read but wouldn’t have been maintained when the page was changed. The checksum is calculated and stamped onto the page when the page is written out of the buffer cache into the operating system page cache (remember the pages get flushed to disk later by a separate fsync call).

The checksum algorithm is specifically designed to take advantage of SIMD instructions. The slowest part of the algorithm is multiplication, so if possible PostgreSQL will be compiled to perform 32 multiplications at a time in parallel. In 2013 there were two platforms specifically documented to support this: x86 SSE4.1 and ARM NEON. The implementation is specifically tuned for optimal performance on x86 SSE. This is an important point actually – particularly for folks running PostgreSQL on embedded devices – PostgreSQL certainly compiles and works well on a lot of platforms, but evidently the checksum implementation is explicitly optimized to run the fastest on Intel. (To be clear… I think everyone should use checksums on every platform unless you have a really really good reason not to – just don’t be surprised if you start comparing benchmarks with Intel and you see a difference!)

For folks interested in digging a little more into the history… here’s the original commit using a CRC algorithm which never shipped in an actual PostgreSQL release (Simon Riggs, Jeff Davis and Greg Smith), here’s the subsequent commit introducing the FNV-1a algorithm instead of CRC which is what we still use today (Ants Aasma), and here’s the commit doing the major refactor which moved the algorithm into a header file for easier external use (Tom Lane).

More Ways To Check Validity

At the SEAPUG meetup this led to a discussion about checking validity. Do checksums cover everything? (hint: no) Are there more ways we can validate our databases? (hint: yes)

I haven’t yet made a full list of which files are covered by checksums and which ones aren’t, but I know that not everything is. For example: I’m pretty sure that the visiblity map and SLRU files aren’t covered with checksums. But for what it’s worth, there are two extra tools we can use verification in PostgreSQL databases:

  • The amcheck extension can scan a B-Tree index for a number of logical problems – for example, verifying that all B-Tree pages have items in “logical” order. (This could be useful, for example, if you’re not specifying ICU collation and you recently upgraded your operating system collation libraries… since PostgreSQL uses OS collation by default.)
  • The pg_visibility_map extension includes two functions to check for corruption in the visibility map – pg_check_frozen() and pg_check_visible().

Responding To Checksum Failures

Finally, what if you actually run into a checksum failure? What should you do, and are there any additional tools you should know about?

First of all – on any database – there are a few things you should always do immediately when you see something indicating that a data corruption may have occurred:

  • Verify that your backup retention and your log file retention are sufficiently long – I recommend at least a month (this is a Happiness Hint). You never know how long the investigation might take, or how long ago something important might have happened.
  • Take a moment to articulate and write out the impact to the business. (Are important queries failing right now? Is this causing an application outage?) This seems small but it can be very useful in making decisions later. Don’t exaggerate the impact but don’t understate it either. It can also be helpful to note important timelines that you already know about. For example: management is willing to use yesterday’s backup and take a day of data loss to avoid an 12 hour outage, but not to avoid a 4 hour outage …or… management needs a status update at 11:00am Eastern Time.
  • If there’s a larger team working on the system in question, communicate a freeze on changes until things are resolved.
  • Make a list or inventory of all copies of the data. Backups, physical replicas or hot standbys, logical replicas, copies, etc. Sometimes the very process of making this list can immediately give you ideas for simple solutions (like checking if a hot standby has the block intact). The next thing you probably want to do is check all items in this list to see if they have a valid copy of the data in question. Do not take any actions to remediate the problem right away, collect all of the information first. The data you collect now might useful during RCA investigation after you’re back online.
  • If there was one checksum failure, then you probably want to make sure there aren’t more.
    • If it’s a small database, consider whether you can scan the whole thing and verify the checksum on every single block.
    • If it’s a large database, consider whether you can at least scan all of the system/catalog tables and perhaps scan the tables which are throwing errors in their entirety. (PostgreSQL stops on the first error, so there isn’t an easy way to check if other blocks in the same table also have checksum problems.)
  • A few general best practices… (1) have a second person glance at your screen before you execute any actual changes, (2) never delete anything but always rename/move instead, (3) when investigating individual blocks, also look at the block before and the block after to verify they look like what you’d normally expect, (4) test the remediation plan before running it in production, and (5) document everything. If you’ve never seen Paul Vallée’s FIT-ACER checklist then it’s worth reviewing.

There’s no single prescriptive process for diagnosing the scope of the problem and finding the right path forward for you. It involves learning what you need to know about PostgreSQL, a little creative thinking about possible resolutions, and balancing the needs of the business (for example, how long can you be down and how much data loss can you tolerate).

That being said, there are a few tools I know about which can be very useful in the process. (And there are probably more that I don’t know about; please let me know and I can add them to this list!)


Data investigation:

Unix/Linux Commands
You might be surprised at what you can do with the tools already installed on your operating system. I’ve never been on a unix system that didn’t have dd and od installed, and I find that many Linux systems have hexdump and md5sum installed as well. A few examples of how useful these tools are: dd can extract the individual block with invalid data on the primary server and extract the same block on the hot standby, then od/hexdump can be used to create a human-readable textual dump of the binary data. You can even use diff to find the differences between the blocks. If you have a standby cluster with storage-level replication then you could use md5sum to see at a glance if the blocks match. (Quick word of caution on comparing hot standbys: last I checked, PostgreSQL doesn’t seem to maintain the free space identically on hot standbys, so the checksums might differ on perfectly healthy blocks. You can still look at the diff and verify whether free space is the only difference.) Drawbacks: low-level utilities can only do binary dumps but cannot interpret the data. Also, utilities like dd are “sharp knives” – powerful tools which can cause damage if misused!

For a great example of using dd and od, read the code in Bertrand Drouvot‘s pg_toolkit script collection.
Data investigation and checksum verification:

pg_filedump
This is a crazy awesome utility and I have no idea why it’s not in core PostgreSQL. It makes an easy-to-read textual dump of the binary contents of PostgreSQL data blocks. You can process a whole file or just specify a range of blocks to dump. It can verify checksums and it can even decode the contents of the tuples. As far as I can tell, it was originally written by Patrick Macdonald at Red Hat some time in the 2000’s and then turned over to the PostgreSQL Global Development Group for stewardship around 2011. It’s now in a dedicated repository at git.postgresql.org and it seems that Tom Lane, Christoph Berg and Teodor Sigaev keep it alive but don’t invest heavily in it. Drawbacks: be aware that it doesn’t address the race condition with a running server (see Credativ pg_checksums below). For dumping only a block with a checksum problem, this is not an issue since the server won’t let the block into its buffer cache anyway.
Checksum verification:

PostgreSQL pg_checksums
PostgreSQL itself starting in version 11 has a command-line utility to scan one relation or everything and verify the checksum on every single block. It’s called pg_verify_checksums in v11 and pg_checksums in v12. Drawbacks: first, this utility requires you to shut down the database before it will run. It will throw an error and refuse to run if the database is up. Second, you can scan a single relation but you can’t say which database it’s in… so if the OID exists in multiple databases, there’s no way to just scan the one you care about.
Checksum verification:

Credativ pg_checksums
The fine engineers of Credativ have published an enhanced version of pg_checksums which can verify checksums on a running database. It looks to me like the main case they needed to protect against was the race condition between pg_checksum reading blocks while the running PostgreSQL server was writing those same blocks. Linux of course work on a 4k page size; so if an 8k database page is half written when pg_checksum reads it then we will get a false positive. The version from credativ however is smart enough to deal with this. Drawbacks: check the github issues; there are a couple notable drawbacks but this project was only announced last week and all the drawbacks might be addressed by the time you read this article. Also, being based on the utility in PostgreSQL, the same limitation about scanning a single relation applies.

Note that both Credativ’s and PostgreSQL’s pg_checksums utilities access the control file, even when just verifying checksums. As a result, you need to make sure you compile against the same version of PostgreSQL code as the target database you’re scanning.
Checksum verification:

Satoshi Nagayasu postgres-toolkit
I’m not sure if this is still being maintained, but Satoshi Nagayasu wrote postgres-toolkit quite a long time ago which includes a checksum verification utility. It’s the oldest one I have seen so far – and it still compiles and works! (Though if you want to compile it on PostgreSQL 11 or newer then you need to use the patch in this pull request.) Satoshi’s utility also has the very useful capability of scanning an arbitrary file that you pass in – like pg_filedump but stripped down to just do the checksum verification. It’s clever enough to infer the segment number from the filename and scan the file, even if the file isn’t part of a PostgreSQL installation. This would be useful, for example, if you were on a backup server and wanted to extract a single file from your backup and check if the damaged block has valid checksum in the backup. Drawbacks: be aware that it doesn’t address the race condition with a running server.
Checksum verification:

Google pg_page_verification
Simple program; you pass in a data directory and it will scan every file in the data directory to verify the checksums on all blocks. Published to Github in early 2018. Drawbacks: be aware that it doesn’t address the race condition with a running server. Probably superseded by the built-in PostgreSQL utilities.
Mitigation:

PostgreSQL Developer Options
PostgreSQL has hundreds of “parameters” – knobs and button you can use to configure how it runs. There are 294 entries in the pg_settings table on version 11. Buried in these parameters are a handful of “Developer Options” providing powerful (and dangerous) tools for mitigating data problems – such as ignore_checksum_failure, zero_damaged_pages and ignore_system_indexes. Read very carefully and exercise great care with these options – when not fully understood, they can have unexpected side effects including unintended data loss. Exercise particular care with the ignore_checksum_failure option – even if you set that in an individual session, the page will be readable to all connections… think of it as poisoning the buffer cache. That being said, sometimes an option like zero_damaged_pages is the fastest way to get back up and running. (Just make sure you’ve saved a copy of that block!) By the way… a trick to trigger a read of one specific block is to SELECT * FROM table WHERE ctid='(blockno,1)'
Mitigation:

Unix/Linux Commands
I would discourage the use of dd to mitigate invalid data problems. It’s dangerous even for experienced engineers; simple mistakes can compound the problem. I can’t imagine a situation where this is a better approach than the zero_damaged_pages developer option and a query to read a specific block. That said, I have seen cases where dd was used to zero out a page.

More About Data Investigation

In order to put some of this together, I’ll just do a quick example session. I’m running PostgreSQL 11.5 an on EC2 instance and I used dd to write a few evil bytes into a couple blocks of my database.

First, lets start by just capturing the information from the log files:

$ grep "invalid page" ../log/postgresql.log|sed 's/UTC.*ERROR//'
 2019-10-15 19:53:37 :  invalid page in block 0 of relation base/16385/16493
 2019-10-16 22:26:30 :  invalid page in block 394216 of relation base/16385/16502
 2019-10-16 22:43:24 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:55:33 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:57:58 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:59:14 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:21 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:22 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:23 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:12 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:16 :  invalid page in block 0 of relation base/16385/16493
 2019-11-06 00:02:05 :  invalid page in block 250 of relation base/16385/16492

With a little command-line karate we can list each distinct block and see the first time we got an error on that block:

$ grep "invalid page" ../log/postgresql.log |
sed 's/UTC.*ERROR//' |
awk '{print $1" "$2" "$11" invalid_8k_block "$8" segment "int($8/131072)" offset "($8%131072)}' |
sort -k3,5 -k1,2 |
uniq -f2
   2019-11-06 00:02:05 base/16385/16492 invalid_8k_block 250 segment 0 offset 250
   2019-10-15 19:53:37 base/16385/16493 invalid_8k_block 0 segment 0 offset 0
   2019-11-05 23:59:14 base/16385/16502 invalid_8k_block 262644 segment 2 offset 500
   2019-10-16 22:26:30 base/16385/16502 invalid_8k_block 394216 segment 3 offset 1000 

So we know that there are at least 4 blocks corrupt. Lets scan the whole data directory using Credativ’s pg_checksum (without shutting down the database) to see if there are any more blocks with bad checksums:

$ pg_checksums -D /var/lib/pgsql/11.5/data |& fold -s
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1000: calculated checksum
 2ED4 but block contains 4EDF
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1010: calculated checksum
 9ECF but block contains ACBE
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.2", block 500: calculated checksum
 5D6 but block contains E459
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16493", block 0: calculated checksum E7E4
 but block contains 78F9
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16492", block 250: calculated checksum
 44BA but block contains 3ABA
 Checksum operation completed
 Files scanned:  1551
 Blocks scanned: 624158
 Bad checksums:  5
 Data checksum version: 1

Ah-ha… there was one more bad checksum which didn’t show up in the logs! Next lets choose one of the bad blocks and dump the contents using unix command line tools.

$ dd status=none if=base/16385/16492 bs=8192 count=1 skip=250 | od -A d -t x1z -w16
0000000 00 00 00 00 e0 df 6b b0 ba 3a 04 00 0c 01 80 01  >......k..:......<
0000016 00 20 04 20 00 00 00 00 80 9f f2 00 00 9f f2 00  >. . ............<
0000032 80 9e f2 00 00 9e f2 00 80 9d f2 00 00 9d f2 00  >................<
0000048 80 9c f2 00 00 9c f2 00 80 9b f2 00 00 9b f2 00  >................<
0000064 80 9a f2 00 00 9a f2 00 80 99 f2 00 00 99 f2 00  >................<
0000080 80 98 f2 00 00 98 f2 00 80 97 f2 00 00 97 f2 00  >................<

Here we see the page header and the beginning of the line pointers. One thing I think it’s easy to remember is that the first 8 bytes are the page LSN and the next 2 bytes are the page checksum. Notice that the page checksum bytes contain “ba 3a” which matches the error message from the scan above (3ABA). Sometimes it can be useful to know just the very top of the page even if you don’t remember anything else!

This is useful, but lets try the pg_filedump utility next. This utility takes a lot of options. In this example I’m going to ask it to verify the checksum (-k), only scan one block at offset 250 (-R 250 250) and even to decode the tuples (table row data) to a human-readable format (-D int,int,int,charN). There’s another argument (-f) that can even tell pg_filedump to show hexdump/od style raw data inline but I won’t demonstrate that here.

$ pg_filedump -k -R 250 250 -D int,int,int,charN base/16385/16492

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 11.0
*
* File: base/16385/16492
* Options used: -k -R 250 250 -D int,int,int,charN
*
* Dump created on: Fri Nov  8 21:48:38 2019
*******************************************************************

Block  250 ********************************************************
----- Block Offset: 0x001f4000 Offsets: Lower 268 (0x010c) Block: Size 8192 Version 4 Upper 384 (0x0180) LSN: logid 0 recoff 0xb06bdfe0 Special 8192 (0x2000) Items: 61 Free Space: 116 Checksum: 0x3aba Prune XID: 0x00000000 Flags: 0x0004 (ALL_VISIBLE) Length (including item array): 268 Error: checksum failure: calculated 0x44ba. ------ Item 1 -- Length: 121 Offset: 8064 (0x1f80) Flags: NORMAL COPY: 15251 1 0 Item 2 -- Length: 121 Offset: 7936 (0x1f00) Flags: NORMAL COPY: 15252 1 0 Item 3 -- Length: 121 Offset: 7808 (0x1e80) Flags: NORMAL COPY: 15253 1 0

That was the block header and the first few item. (Item pointer data is displayed first, then the table row data itself is displayed on the following line after the word COPY.) Looking down a little bit, we can even see where I wrote the bytes “budstuff” into a random location in this block – it turns out those bytes landed in the middle of a character field. This means that without checksums, PostgreSQL would not have thrown any errors at all but just returned an incorrect string the next time that row was queried!

COPY: 15297	1	0
 Item  48 -- Length:  121  Offset: 2048 (0x0800)  Flags: NORMAL
COPY: 15298	1	0
 Item  49 -- Length:  121  Offset: 1920 (0x0780)  Flags: NORMAL
COPY: 15299	1	0	                                           badstuff
 Item  50 -- Length:  121  Offset: 1792 (0x0700)  Flags: NORMAL
COPY: 15300	1	0

It’s immediately clear how useful this is (and easier to read ). The part where it decodes the table row data into a human readable form is an especially cool trick. Two notes about this.

  • First, the lines are prefixed with the word COPY for a reason – they are actually intended to be formatted so you can grep on the word COPY and then use the “copy” command (or it’s psql cousin) to feed the data directly back into a staging table in the database for cleanup. How cool is that!
  • Second, it can decode only a set of fairly standard data types and relation types.
$ pg_filedump -h

Version 11.0 (for PostgreSQL 11.x)
Copyright (c) 2002-2010 Red Hat, Inc.
Copyright (c) 2011-2018, PostgreSQL Global Development Group

Usage: pg_filedump [-abcdfhikxy] [-R startblock [endblock]] 
          [-D attrlist] [-S blocksize] [-s segsize] [-n segnumber] file

Display formatted contents of a PostgreSQL heap/index/control file
Defaults are: relative addressing, range of the entire file, block
               size as listed on block 0 in the file

The following options are valid for heap and index files:
...
...
...
  -D  Decode tuples using given comma separated list of types
      Supported types:
        bigint bigserial bool char charN date float float4 float8 int
        json macaddr name oid real serial smallint smallserial text
        time timestamp timetz uuid varchar varcharN xid xml
      ~ ignores all attributes left in a tuple

Now you see what happens when I start having fun… a quick update about our SEAPUG meetup this past Tuesday turned into a blog article that’s way too long. :) Hope it’s useful, and as always let me know what I can improve!

PostgreSQL Invalid Page and Checksum Verification Failed

At the Seattle PostgreSQL User Group meetup this past Tuesday, we got onto the topic of invalid pages in PostgreSQL. It was a fun discussion and it made me realize that it’d be worth writing down a bunch of the stuff we talked about – it might be interesting to a few more people too!

Invalid Page In Block

You see an error message that looks like this:

ERROR: invalid page in block 1226710 of relation base/16750/27244

First and foremost – what does this error mean? I like to think of PostgreSQL as having a fairly strong “boundary” between (1) the database itself and (2) the operating system [and by extension everything else… firmware, disks, network, remote storage, etc]. PostgreSQL maintains page validity primarily on the way in and out of its buffer cache.

https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=150 150w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=300 300w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png?w=768 768w, https://ardentperf.files.wordpress.com/2019/11/pg-page-validity.png 1629w" sizes="(max-width: 1024px) 100vw, 1024px" />

What does this mean in practice? Suppose there’s a physical memory failure and somehow the ECC parity is unable to detect it. This means that a little bit of physical memory on the server now has incorrect garbage and the correct data from that memory is lost.

  • If the garbage bytes map to part of the kernel page cache, then when PostgreSQL tries to copy the page into it’s buffer cache then it will (if possible) detect that something is wrong, refuse to poison its buffer cache with this invalid 8k page, and error out any queries that require this page for processing with the ERROR message above.
  • If the garbage bytes map to part of PostgreSQL’s database buffer cache, then PostgreSQL will quite happily assume nothing is wrong and attempt to process the data on the page. Results are unpredictable; probably all sorts of ERROR messages, crashes and failure modes could result – or maybe even incorrect data returned with no ERROR message at all. (Note that this is probably the case for nearly all software… and also note that ECC is pretty good.)

How PostgreSQL Checks Page Validity

PostgreSQL has two main “validity checks” that it performs on pages. You can read the code in the function PageIsVerified() but I’ll summarize here. You can tell from your error message which validity check failed. It depends on whether you see a second additional WARNING right before the ERROR. The warning would look like this:

WARNING: page verification failed, calculated checksum 3482 but expected 32232
  1. If the above warning is not present, this means the page header failed a basic sanity check. This could conceivably be caused by both problems inside and outside of PostgreSQL.
  2. If you see the above warning (page verification failed), this means the checksum recorded in the block did not match the checksum calculated for the block. This most likely indicates that there was a problem outside of (underneath) the database – operating system, memory, networking, storage, etc.

About Basic Page Header Sanity Checks

As of when I’m writing this article in 2019, the following basic sanity checks are performed on the page header:

  • There are 32 bits reserved for page flag bits; at present only three are used and the other 29 bits should always be zero/off.
  • Every page is divided into four parts (header, free space, tuples, special space).  Offsets for the divisions are stored as 16-bit numbers in the page header; the offsets should go in order and should not have a value pointing off the page.
  • The offset of the special space should always be aligned.

About PostgreSQL Checksums

PostgreSQL version 9.3 (released in 2013) added the ability to calculate a checksum on data pages and store the checksum in the page. There are two inputs to the checksum: (1) every single byte of the data page, with zeros in the four bytes where the checksum will be stored later and (2) the page offset/address. This means that PostgreSQL doesn’t just detect if a byte is changed in the page – it also detects if a perfectly valid page gets somehow dropped into the wrong place.

Checksums are not maintained for blocks while they are in the shared buffers – so if you look at a buffer in the PostgreSQL page cache with pageinspect and you see a checksum value, note that it’s probably just leftover from the last read but wouldn’t have been maintained when the page was changed. The checksum is calculated and stamped onto the page when the page is written out of the buffer cache into the operating system page cache (remember the pages get flushed to disk later by a separate fsync call).

The checksum algorithm is specifically designed to take advantage of SIMD instructions. The slowest part of the algorithm is multiplication, so if possible PostgreSQL will be compiled to perform 32 multiplications at a time in parallel. In 2013 there were two platforms specifically documented to support this: x86 SSE4.1 and ARM NEON. The implementation is specifically tuned for optimal performance on x86 SSE. This is an important point actually – particularly for folks running PostgreSQL on embedded devices – PostgreSQL certainly compiles and works well on a lot of platforms, but evidently the checksum implementation is explicitly optimized to run the fastest on Intel. (To be clear… I think everyone should use checksums on every platform unless you have a really really good reason not to – just don’t be surprised if you start comparing benchmarks with Intel and you see a difference!)

For folks interested in digging a little more into the history… here’s the original commit using a CRC algorithm which never shipped in an actual PostgreSQL release (Simon Riggs, Jeff Davis and Greg Smith), here’s the subsequent commit introducing the FNV-1a algorithm instead of CRC which is what we still use today (Ants Aasma), and here’s the commit doing the major refactor which moved the algorithm into a header file for easier external use (Tom Lane).

More Ways To Check Validity

At the SEAPUG meetup this led to a discussion about checking validity. Do checksums cover everything? (hint: no) Are there more ways we can validate our databases? (hint: yes)

I haven’t yet made a full list of which files are covered by checksums and which ones aren’t, but I know that not everything is. For example: I’m pretty sure that the visiblity map and SLRU files aren’t covered with checksums. But for what it’s worth, there are two extra tools we can use verification in PostgreSQL databases:

  • The amcheck extension can scan a B-Tree index for a number of logical problems – for example, verifying that all B-Tree pages have items in “logical” order. (This could be useful, for example, if you’re not specifying ICU collation and you recently upgraded your operating system collation libraries… since PostgreSQL uses OS collation by default.)
  • The pg_visibility_map extension includes two functions to check for corruption in the visibility map – pg_check_frozen() and pg_check_visible().

Responding To Checksum Failures

Finally, what if you actually run into a checksum failure? What should you do, and are there any additional tools you should know about?

First of all – on any database – there are a few things you should always do immediately when you see something indicating that a data corruption may have occurred:

  • Verify that your backup retention and your log file retention are sufficiently long – I recommend at least a month (this is a Happiness Hint). You never know how long the investigation might take, or how long ago something important might have happened.
  • Take a moment to articulate and write out the impact to the business. (Are important queries failing right now? Is this causing an application outage?) This seems small but it can be very useful in making decisions later. Don’t exaggerate the impact but don’t understate it either. It can also be helpful to note important timelines that you already know about. For example: management is willing to use yesterday’s backup and take a day of data loss to avoid an 12 hour outage, but not to avoid a 4 hour outage …or… management needs a status update at 11:00am Eastern Time.
  • If there’s a larger team working on the system in question, communicate a freeze on changes until things are resolved.
  • Make a list or inventory of all copies of the data. Backups, physical replicas or hot standbys, logical replicas, copies, etc. Sometimes the very process of making this list can immediately give you ideas for simple solutions (like checking if a hot standby has the block intact). The next thing you probably want to do is check all items in this list to see if they have a valid copy of the data in question. Do not take any actions to remediate the problem right away, collect all of the information first. The data you collect now might useful during RCA investigation after you’re back online.
  • If there was one checksum failure, then you probably want to make sure there aren’t more.
    • If it’s a small database, consider whether you can scan the whole thing and verify the checksum on every single block.
    • If it’s a large database, consider whether you can at least scan all of the system/catalog tables and perhaps scan the tables which are throwing errors in their entirety. (PostgreSQL stops on the first error, so there isn’t an easy way to check if other blocks in the same table also have checksum problems.)
  • A few general best practices… (1) have a second person glance at your screen before you execute any actual changes, (2) never delete anything but always rename/move instead, (3) when investigating individual blocks, also look at the block before and the block after to verify they look like what you’d normally expect, (4) test the remediation plan before running it in production, and (5) document everything. If you’ve never seen Paul Vallée’s FIT-ACER checklist then it’s worth reviewing.

There’s no single prescriptive process for diagnosing the scope of the problem and finding the right path forward for you. It involves learning what you need to know about PostgreSQL, a little creative thinking about possible resolutions, and balancing the needs of the business (for example, how long can you be down and how much data loss can you tolerate).

That being said, there are a few tools I know about which can be very useful in the process. (And there are probably more that I don’t know about; please let me know and I can add them to this list!)


Data investigation:

Unix/Linux Commands
You might be surprised at what you can do with the tools already installed on your operating system. I’ve never been on a unix system that didn’t have dd and od installed, and I find that many Linux systems have hexdump and md5sum installed as well. A few examples of how useful these tools are: dd can extract the individual block with invalid data on the primary server and extract the same block on the hot standby, then od/hexdump can be used to create a human-readable textual dump of the binary data. You can even use diff to find the differences between the blocks. If you have a standby cluster with storage-level replication then you could use md5sum to see at a glance if the blocks match. (Quick word of caution on comparing hot standbys: last I checked, PostgreSQL doesn’t seem to maintain the free space identically on hot standbys, so the checksums might differ on perfectly healthy blocks. You can still look at the diff and verify whether free space is the only difference.) Drawbacks: low-level utilities can only do binary dumps but cannot interpret the data. Also, utilities like dd are “sharp knives” – powerful tools which can cause damage if misused!

For a great example of using dd and od, read the code in Bertrand Drouvot‘s pg_toolkit script collection.
Data investigation and checksum verification:

pg_filedump
This is a crazy awesome utility and I have no idea why it’s not in core PostgreSQL. It makes an easy-to-read textual dump of the binary contents of PostgreSQL data blocks. You can process a whole file or just specify a range of blocks to dump. It can verify checksums and it can even decode the contents of the tuples. As far as I can tell, it was originally written by Patrick Macdonald at Red Hat some time in the 2000’s and then turned over to the PostgreSQL Global Development Group for stewardship around 2011. It’s now in a dedicated repository at git.postgresql.org and it seems that Tom Lane, Christoph Berg and Teodor Sigaev keep it alive but don’t invest heavily in it. Drawbacks: be aware that it doesn’t address the race condition with a running server (see Credativ pg_checksums below). For dumping only a block with a checksum problem, this is not an issue since the server won’t let the block into its buffer cache anyway.
Checksum verification:

PostgreSQL pg_checksums
PostgreSQL itself starting in version 11 has a command-line utility to scan one relation or everything and verify the checksum on every single block. It’s called pg_verify_checksums in v11 and pg_checksums in v12. Drawbacks: first, this utility requires you to shut down the database before it will run. It will throw an error and refuse to run if the database is up. Second, you can scan a single relation but you can’t say which database it’s in… so if the OID exists in multiple databases, there’s no way to just scan the one you care about.
Checksum verification:

Credativ pg_checksums
The fine engineers of Credativ have published an enhanced version of pg_checksums which can verify checksums on a running database. It looks to me like the main case they needed to protect against was the race condition between pg_checksum reading blocks while the running PostgreSQL server was writing those same blocks. Linux of course work on a 4k page size; so if an 8k database page is half written when pg_checksum reads it then we will get a false positive. The version from credativ however is smart enough to deal with this. Drawbacks: check the github issues; there are a couple notable drawbacks but this project was only announced last week and all the drawbacks might be addressed by the time you read this article. Also, being based on the utility in PostgreSQL, the same limitation about scanning a single relation applies.

Note that both Credativ’s and PostgreSQL’s pg_checksums utilities access the control file, even when just verifying checksums. As a result, you need to make sure you compile against the same version of PostgreSQL code as the target database you’re scanning.
Checksum verification:

Satoshi Nagayasu postgres-toolkit
I’m not sure if this is still being maintained, but Satoshi Nagayasu wrote postgres-toolkit quite a long time ago which includes a checksum verification utility. It’s the oldest one I have seen so far – and it still compiles and works! (Though if you want to compile it on PostgreSQL 11 or newer then you need to use the patch in this pull request.) Satoshi’s utility also has the very useful capability of scanning an arbitrary file that you pass in – like pg_filedump but stripped down to just do the checksum verification. It’s clever enough to infer the segment number from the filename and scan the file, even if the file isn’t part of a PostgreSQL installation. This would be useful, for example, if you were on a backup server and wanted to extract a single file from your backup and check if the damaged block has valid checksum in the backup. Drawbacks: be aware that it doesn’t address the race condition with a running server.
Checksum verification:

Google pg_page_verification
Simple program; you pass in a data directory and it will scan every file in the data directory to verify the checksums on all blocks. Published to Github in early 2018. Drawbacks: be aware that it doesn’t address the race condition with a running server. Probably superseded by the built-in PostgreSQL utilities.
Mitigation:

PostgreSQL Developer Options
PostgreSQL has hundreds of “parameters” – knobs and button you can use to configure how it runs. There are 294 entries in the pg_settings table on version 11. Buried in these parameters are a handful of “Developer Options” providing powerful (and dangerous) tools for mitigating data problems – such as ignore_checksum_failure, zero_damaged_pages and ignore_system_indexes. Read very carefully and exercise great care with these options – when not fully understood, they can have unexpected side effects including unintended data loss. Exercise particular care with the ignore_checksum_failure option – even if you set that in an individual session, the page will be readable to all connections… think of it as poisoning the buffer cache. That being said, sometimes an option like zero_damaged_pages is the fastest way to get back up and running. (Just make sure you’ve saved a copy of that block!) By the way… a trick to trigger a read of one specific block is to SELECT * FROM table WHERE ctid='(blockno,1)'
Mitigation:

Unix/Linux Commands
I would discourage the use of dd to mitigate invalid data problems. It’s dangerous even for experienced engineers; simple mistakes can compound the problem. I can’t imagine a situation where this is a better approach than the zero_damaged_pages developer option and a query to read a specific block. That said, I have seen cases where dd was used to zero out a page.

More About Data Investigation

In order to put some of this together, I’ll just do a quick example session. I’m running PostgreSQL 11.5 an on EC2 instance and I used dd to write a few evil bytes into a couple blocks of my database.

First, lets start by just capturing the information from the log files:

$ grep "invalid page" ../log/postgresql.log|sed 's/UTC.*ERROR//'
 2019-10-15 19:53:37 :  invalid page in block 0 of relation base/16385/16493
 2019-10-16 22:26:30 :  invalid page in block 394216 of relation base/16385/16502
 2019-10-16 22:43:24 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:55:33 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:57:58 :  invalid page in block 394216 of relation base/16385/16502
 2019-11-05 23:59:14 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:21 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:22 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-05 23:59:23 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:12 :  invalid page in block 262644 of relation base/16385/16502
 2019-11-06 00:01:16 :  invalid page in block 0 of relation base/16385/16493
 2019-11-06 00:02:05 :  invalid page in block 250 of relation base/16385/16492

With a little command-line karate we can list each distinct block and see the first time we got an error on that block:

$ grep "invalid page" ../log/postgresql.log |
sed 's/UTC.*ERROR//' |
awk '{print $1" "$2" "$11" invalid_8k_block "$8" segment "int($8/131072)" offset "($8%131072)}' |
sort -k3,5 -k1,2 |
uniq -f2
   2019-11-06 00:02:05 base/16385/16492 invalid_8k_block 250 segment 0 offset 250
   2019-10-15 19:53:37 base/16385/16493 invalid_8k_block 0 segment 0 offset 0
   2019-11-05 23:59:14 base/16385/16502 invalid_8k_block 262644 segment 2 offset 500
   2019-10-16 22:26:30 base/16385/16502 invalid_8k_block 394216 segment 3 offset 1000 

So we know that there are at least 4 blocks corrupt. Lets scan the whole data directory using Credativ’s pg_checksum (without shutting down the database) to see if there are any more blocks with bad checksums:

$ pg_checksums -D /var/lib/pgsql/11.5/data |& fold -s
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1000: calculated checksum
 2ED4 but block contains 4EDF
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.3", block 1010: calculated checksum
 9ECF but block contains ACBE
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16502.2", block 500: calculated checksum
 5D6 but block contains E459
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16493", block 0: calculated checksum E7E4
 but block contains 78F9
 pg_checksums: error: checksum verification failed in file
 "/var/lib/pgsql/11.5/data/base/16385/16492", block 250: calculated checksum
 44BA but block contains 3ABA
 Checksum operation completed
 Files scanned:  1551
 Blocks scanned: 624158
 Bad checksums:  5
 Data checksum version: 1

Ah-ha… there was one more bad checksum which didn’t show up in the logs! Next lets choose one of the bad blocks and dump the contents using unix command line tools.

$ dd status=none if=base/16385/16492 bs=8192 count=1 skip=250 | od -A d -t x1z -w16
0000000 00 00 00 00 e0 df 6b b0 ba 3a 04 00 0c 01 80 01  >......k..:......<
0000016 00 20 04 20 00 00 00 00 80 9f f2 00 00 9f f2 00  >. . ............<
0000032 80 9e f2 00 00 9e f2 00 80 9d f2 00 00 9d f2 00  >................<
0000048 80 9c f2 00 00 9c f2 00 80 9b f2 00 00 9b f2 00  >................<
0000064 80 9a f2 00 00 9a f2 00 80 99 f2 00 00 99 f2 00  >................<
0000080 80 98 f2 00 00 98 f2 00 80 97 f2 00 00 97 f2 00  >................<

Here we see the page header and the beginning of the line pointers. One thing I think it’s easy to remember is that the first 8 bytes are the page LSN and the next 2 bytes are the page checksum. Notice that the page checksum bytes contain “ba 3a” which matches the error message from the scan above (3ABA). Sometimes it can be useful to know just the very top of the page even if you don’t remember anything else!

This is useful, but lets try the pg_filedump utility next. This utility takes a lot of options. In this example I’m going to ask it to verify the checksum (-k), only scan one block at offset 250 (-R 250 250) and even to decode the tuples (table row data) to a human-readable format (-D int,int,int,charN). There’s another argument (-f) that can even tell pg_filedump to show hexdump/od style raw data inline but I won’t demonstrate that here.

$ pg_filedump -k -R 250 250 -D int,int,int,charN base/16385/16492

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 11.0
*
* File: base/16385/16492
* Options used: -k -R 250 250 -D int,int,int,charN
*
* Dump created on: Fri Nov  8 21:48:38 2019
*******************************************************************

Block  250 ********************************************************
----- Block Offset: 0x001f4000 Offsets: Lower 268 (0x010c) Block: Size 8192 Version 4 Upper 384 (0x0180) LSN: logid 0 recoff 0xb06bdfe0 Special 8192 (0x2000) Items: 61 Free Space: 116 Checksum: 0x3aba Prune XID: 0x00000000 Flags: 0x0004 (ALL_VISIBLE) Length (including item array): 268 Error: checksum failure: calculated 0x44ba. ------ Item 1 -- Length: 121 Offset: 8064 (0x1f80) Flags: NORMAL COPY: 15251 1 0 Item 2 -- Length: 121 Offset: 7936 (0x1f00) Flags: NORMAL COPY: 15252 1 0 Item 3 -- Length: 121 Offset: 7808 (0x1e80) Flags: NORMAL COPY: 15253 1 0

That was the block header and the first few item. (Item pointer data is displayed first, then the table row data itself is displayed on the following line after the word COPY.) Looking down a little bit, we can even see where I wrote the bytes “budstuff” into a random location in this block – it turns out those bytes landed in the middle of a character field. This means that without checksums, PostgreSQL would not have thrown any errors at all but just returned an incorrect string the next time that row was queried!

COPY: 15297	1	0
 Item  48 -- Length:  121  Offset: 2048 (0x0800)  Flags: NORMAL
COPY: 15298	1	0
 Item  49 -- Length:  121  Offset: 1920 (0x0780)  Flags: NORMAL
COPY: 15299	1	0	                                           badstuff
 Item  50 -- Length:  121  Offset: 1792 (0x0700)  Flags: NORMAL
COPY: 15300	1	0

It’s immediately clear how useful this is (and easier to read ). The part where it decodes the table row data into a human readable form is an especially cool trick. Two notes about this.

  • First, the lines are prefixed with the word COPY for a reason – they are actually intended to be formatted so you can grep on the word COPY and then use the “copy” command (or it’s psql cousin) to feed the data directly back into a staging table in the database for cleanup. How cool is that!
  • Second, it can decode only a set of fairly standard data types and relation types.
$ pg_filedump -h

Version 11.0 (for PostgreSQL 11.x)
Copyright (c) 2002-2010 Red Hat, Inc.
Copyright (c) 2011-2018, PostgreSQL Global Development Group

Usage: pg_filedump [-abcdfhikxy] [-R startblock [endblock]] 
          [-D attrlist] [-S blocksize] [-s segsize] [-n segnumber] file

Display formatted contents of a PostgreSQL heap/index/control file
Defaults are: relative addressing, range of the entire file, block
               size as listed on block 0 in the file

The following options are valid for heap and index files:
...
...
...
  -D  Decode tuples using given comma separated list of types
      Supported types:
        bigint bigserial bool char charN date float float4 float8 int
        json macaddr name oid real serial smallint smallserial text
        time timestamp timetz uuid varchar varcharN xid xml
      ~ ignores all attributes left in a tuple

Now you see what happens when I start having fun… a quick update about our SEAPUG meetup this past Tuesday turned into a blog article that’s way too long. :) Hope it’s useful, and as always let me know what I can improve!

PASS Summit 2nd Day Keynote

Woohoo, we’ve made it to the second day of PASS Summit!  After an awesome first day running from the keynote, to my first session with Denise McInerney, late for my executive meeting with less than a dozen other attendees with Rohan Kumar and sessions/networking/parties, here we are for the second day of PASS Summit keynote!

Info and Awards

We started the day with Wendy Pastrick to tell us about the financial status of PASS and she decided to make sure we were awake and serenade us with a lovely song first!

Tim Ford was next to talk about what the community and PASS means to all of us.  He announced the PASSion award, which was given to my wonderful friend and co-presenter in webinars and fellow Linux Learning Pathway sessions, Hamish Watson!  Congratulations, Hamish!!

2nd Day Keynote- TARAH WHEELER!!

Tarah Wheeler is a cybersecurity expert and has written a book on “Women in Tech, Take Your Career to the Next Level“.  She came to speak to us about the daily challenges of data breaches in companies and why we are an important part of working to solve it.  She wants DBAs and others in our industry to be diplomats.  We should recognize the vulnerabilities and work with our companies to make them more cyber-secure.

She discussed the isolation and separation of the internet into three functional internets due to GDPR and other data privacy and government policies in the world today.

https://dbakevlar.com/wp-content/uploads/2019/11/7ADFA9F8-14FA-4CE3-A1DE... 300w, https://dbakevlar.com/wp-content/uploads/2019/11/7ADFA9F8-14FA-4CE3-A1DE... 768w" sizes="(max-width: 800px) 100vw, 800px" />
She discussed how data is simply collected in countries like China vs. the rest of the world, along with GDPR and how most data retention isn’t designed for GDPR policies.  Data retention, backup and collection all has implications depending on these.  How we have to think about data has changed drastically in recent years.  Some of this has come from capabilities in technology, some has come from regulations/policies and others through breaches.

AWESOMESAUCE

I really, really enjoyed this keynote-  as much as I enjoyed hearing about the future of our own platform of products from Microsoft after the MSIgnite announcements, Tarah’s presentation was timely and dramatically important to us as data professionals.

The redundancy we put into our systems to protect our data also makes us vulnerable and this talk resonated with me deeply.  The constant conflict between retain the data and remove it when policy/regulations/individuals rights state we must.  It’s a constant push and pull of data professionals today.

She discussed the importance of future investment in data retention policies, cybersecurity and what I know many of us knew was coming.  We have a tendency to procrastinate.  We think it’s always tomorrow that we need to worry about it because there is so much else that we need to worry about, but critical data and policies around GDPR and cybersecurity is an essential topic TODAY.

I did a lot of GDPR talks, including an intro talk last year at PASS Summit.  Her use cases on China regulations, GDPR and California combinations, etc. are similar to ones I created in my own examples.  I was able to quickly demonstrate to any attendee of the talk who told me they weren’t susceptible to GDPR regulations how they were.  The ease at which I was able to make them susceptible zwas quite mind-blowing, so I was thrilled she was able to demonstrate this to the entire attendance at Summit.  I’m all @TeamTarah now.

Thank you to PASS for having Tarah as the second day keynote-  it was a great keynote today and with yesterday’s Rohan yesterday, I’m looking forward to tomorrow’s even more!

 

 



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright ©  [PASS Summit 2nd Day Keynote], All Right Reserved. 2019.

Oracle database and grid home patches to install

This blogpost is about Oracle database and grid infrastructure software homes, which patches should be applied to which homes, and what it then looks like. This is fully documented by MyOracleSupport notes, but you will see that with version 18 and up this is unclear.

I keep a script-set that automatically installs and patches the Oracle database software and creates a database. This script-set is called vagrant-builder, and it can install any version with any PSU applied between 11.2.0.2 up to 19.5, which is the latest PSU of the latest version, with a few exceptions: for 11.2.0.2 and 11.2.0.3 I only created an install for the base version and the latest PSU for the database, and version 12.1.0.1 is left out entirely.

I recently reviewed my installs and verified everything is carried out correctly. First a simple overview of what I think should be applied on the database and grid infrastructure install:

Version  Grid               Database
-------- ------             ----------
11.2.0.2 -                  DB PSU
11.2.0.3 -                  DB PSU+OJVM
11.2.0.4 GI PSU+JDBC patch  DB PSU+OJVM
12.1.0.2 GI PSU+JDBC patch  DB PSU+OJVM
12.2.0.1 GI PSU             DB PSU+OJVM
18       GI PSU             DB PSU+OJVM
19       GI PSU             DB PSU+OJVM

(‘-‘ means not investigated)
My idea of what should be applied is based on MOS note 1929745.1: Oracle recommended patches.

Grid patches, JDBC patch
The JDBC patch for grid infrastructure 11.2.0.4 and 12.1.0.2 is a patch that updates java classes. Therefore the patch is a generic one, the java classes do not contain operating system dependent machine code. The table in the MOS note also show differences for the JDBC patch between = januari 2014 = july 2016 and higher. No surprise there. It’s especially easy because all versions in premier support as of the date of this blogpost do not need the JDBC patch.

Grid patches, GI PSU
For the GI PSU, there are a lot of patches that contain the GI PSU, because outside of the GI patch itself, there are also combination patches that for example contain both the GI and the DB PSU. I like to keep it as simple as I can. Therefore, I stick to MOS note 2118136.2: Assistant: Download Reference for Oracle Database/GI Update, Revision, PSU, SPU(CPU), bundle patches, Patchsets and Base releases, and look at the following:
– versions 11.2.0.4/12.1.0.2: Oracle Database PSU, SPU(CPU), Bundle Patches (Versions 12.1 & lower), version (11.2.0.4/12.1.0.2), GI PSU column.
– versions 12.2.0.1 and up: Oracle Database Updates, version (12.2.0.1,18.0.0.0,19.0.0.0), GI Update column.

Database patches, DB PSU
Here too there are multiple patches that can be used to apply the DB PSU, and I stick with MOS note 2118136.2: Assistant: Download Reference for Oracle Database/GI Update, Revision, PSU, SPU(CPU), bundle patches, Patchsets and Base releases, and look at the following:
– versions 11.2.0.4/12.1.0.2: Oracle Database PSU, SPU(CPU), Bundle Patches (Versions 12.1 & lower), version (11.2.0.4/12.1.0.2), PSU column.
– versions 12.2.0.1 and up: Oracle Database Updates, version (12.2.0.1,18.0.0.0,19.0.0.0), DB Update column.

Database patches, OJVM
There are multiple MOS documents talking about the database JavaVM patch, and there are multiple patches, but here I stick with MOS note 2118136.2: Assistant: Download Reference for Oracle Database/GI Update, Revision, PSU, SPU(CPU), bundle patches, Patchsets and Base releases once again, and look at the following:
– OJVM Update/PSU/Bundle Patches, 11.2.0.4/12.1.0.2/12.2.0.1/18.0.0.0/19.0.0.0, OJVM Update.

Okay. So we got the table above that is based on MOS note 1929745.1, and we got all the patches organised in MOS note 2118136.2. So that’s nice and simple, right?

Well, not entirely…

As I said, I was checking up on the latest PSU installs. This is the ‘opatch lspatches’ overview of 12.1.0.2 and 12.2.0.1:

-- 12.1.0.2
GI 190716
---------
23727148;
29509318;OCW PATCH SET UPDATE 12.1.0.2.190716 (29509318)
29494060;Database Patch Set Update : 12.1.0.2.190716 (29494060)
29423125;ACFS PATCH SET UPDATE 12.1.0.2.190716 (29423125)
26983807;WLM Patch Set Update: 12.1.0.2.180116 (26983807)
DB 190716
---------
29774383;Database PSU 12.1.0.2.190716, Oracle JavaVM Component (JUL2019)
29494060;Database Patch Set Update : 12.1.0.2.190716 (29494060)
-- 12.2.0.1
GI 191015
---------
30138470;Database Oct 2019 Release Update : 12.2.0.1.191015 (30138470)
30122828;ACFS OCT 2019 RELEASE UPDATE 12.2.0.1.0 (30122828)
30122814;OCW OCT 2019 RELEASE UPDATE 12.2.0.1.191015 (30122814)
30093408;TOMCAT RELEASE UPDATE 12.2.0.1.0(ID:190208.0920) (30093408)
26839277;DBWLM RELEASE UPDATE 12.2.0.1.0(ID:170913) (26839277
DB 191015
---------
30133625;OJVM RELEASE UPDATE: 12.2.0.1.191015 (30133625)
30138470;Database Oct 2019 Release Update : 12.2.0.1.191015 (30138470)

For the database (DB), we see two patches, which is exactly what is expected.
– The database patch, which is called release update with one and patch set update with the other.
– The OJVM update which also named differently between the two versions, but very much recognisable as such.
I think it’s bad that the names vary, but this is totally expected.

For the grid infrastructure (GI), we see 5 patches in both situations, but these are not the same patches between the homes!
– The unnamed patch in the 12.1.0.2 home is the JDBC patch, which should only be applied to 12.1.0.2, not to higher versions.
– The OCW patch (oracle clusterware).
– The database patch.
– The ACFS patch. The version indication in the patch name changed.
– The WLM/DBWLM patch. Sadly the name changed, and the name with 12.2.0.1 is actually quite useless, I can’t tell the actual version, I have to look up the patch number.
– Starting from version 12.2.0.1, there is a tomcat installation in the grid home, as this patch indicates. The name here is not helpful because it doesn’t indicate the actual version, like with the DBWLM patch.

So, outside of in my opinion bad naming, and a weird inclusion of a competing product of Oracle (Tomcat versus Weblogic), this still follows the rules of logic.

Now let’s look at the same output for version 18 and 19:

--18
GI 18.8
30116128;ACFS RELEASE UPDATE 18.8.0.0.0 (30116128)
30113775;OCW RELEASE UPDATE 18.8.0.0.0 (30113775)
30112122;Database Release Update : 18.8.0.0.191015 (30112122)
30093398;TOMCAT RELEASE UPDATE 18.0.0.0.0 (30093398)
28655963;DBWLM RELEASE UPDATE 18.4.0.0.0 (28655963)
27923415;OJVM RELEASE UPDATE: 18.3.0.0.180717 (27923415)
DB 18.8
30133603;OJVM RELEASE UPDATE: 18.8.0.0.191015 (30133603)
30112122;Database Release Update : 18.8.0.0.191015 (30112122)
28090553;OCW RELEASE UPDATE 18.3.0.0.0 (28090553)
--19
GI 19.5
30125133;Database Release Update : 19.5.0.0.191015 (30125133)
30122167;ACFS RELEASE UPDATE 19.5.0.0.0 (30122167)
30122149;OCW RELEASE UPDATE 19.5.0.0.0 (30122149)
29401763;TOMCAT RELEASE UPDATE 19.0.0.0.0 (29401763)
DB 19.5
30128191;OJVM RELEASE UPDATE: 19.5.0.0.191015 (30128191)
30125133;Database Release Update : 19.5.0.0.191015 (30125133)
29585399;OCW RELEASE UPDATE 19.3.0.0.0 (29585399)

I must say that the naming in general looks more consistent, that is a good thing!

For the database (DB) we see THREE patches (I suspected 2):
– The database release update patch, which nicely reports its version.
– The OVM patch, which also reports its version.
– This is weird. One of the grid infrastructure patches, the “OCW” patch, is applied to the database home. In fact, this is applied to the base release. Because it’s not a patch that is documented to be needed to be applied, this will sit at this version and never be updated. However, this unexpected patch is consistently applied to the base release for both version 18 and 19.

For the grid home (GI) we see an inconsistent number of patches (!) between 18.8 and 19.5. This is partly as expected, but I found an weird patch applied too. Let’s go over the patches:
– The ACFS patch is totally expected and appropriately named.
– The OCW patch is expected and appropriately named too.
– The Database Release Update patch is expected and appropriately named.
– The TOMCAT patch is expected. I don’t understand why it can’t have the RU numbering, but at least it’s consistent between 18 and 19.
– The DBWLM patch is only applied to the 18 home. As far as I understand, this is how it’s supposed to be, DBWLM is not regularly updated like the other ones above, so it’s okay to have an older version of it with the other patches, and if there isn’t a patch to apply, it can simply not be there, like with version 19 in this case.
– The OJVM patch puzzles me. I don’t know what to think of it. Also, it’s only applied to the base release of version 18, not to the base release of version 19. This, very much like the OCW patch, will never be updated. But I just don’t understand, this patches the java virtual machine in the database, which for GI is the ASM instance, for which, as far as I know, the java virtual machine isn’t used.

Conclusion.
The naming of the patches as visible with “opatch lspatches” has certainly improved with version 18 and higher. Still it would be helpful if the grid infrastructure tomcat patch would follow the same naming of the other patches.

I am in doubts about the inclusion of two patches in the base releases of versions 18 and 19:
The OJVM patch inclusion in the base release of grid infrastructure of version 18 only.
The OCW patch inclusion in the base release of the database of version 18 and 19.

Addendum.
After debating this on twitter and with my colleagues, I found that my OCW assumptions were incorrect. The grid infrastructure patch versions 12.2 and up reasonably clearly describes that the OCW sub-patch that is part of the GI PSU/RU patch should be applied to the database home too (if cluster ware is used for that home). So that means that if you got another version of the database home than the grid infrastructure home and it is used with grid infrastructure, you should download the grid infrastructure PSU/RU patch and apply the OCW sub patch to the oracle database home, because the OCW patch is not in the database PSU/RU patch.
Because the database home patch itself is also in the grid infrastructure PSU/RU patch, I see no reason to bother downloading the database patch, and now only download and use the grid infrastructure PSU/RU patch, because that contains the database home patch as well as the OCW patch.

Addendum 2.
The OCW patch being installed into the grid infrastructure home and additionally in the database home turns out to be a change with PSU 11.2.0.4.5. Before that, the OCW patch did exist with the grid infrastructure home patch, but couldn’t be applied to the database home. This change was documented in the patch readme, but wasn’t really heavily marketed by Oracle.

Addendum 3.
Despite OCW being documented as being additionally applied to the database home for 11.2.0.4(.5) and higher, in my tests it was not possible to apply the 11.2.0.4 OCW patch to a database home for any 11.2.0.4 PSU; opatch fails with a dependency problem. It does succeed starting from 12.1.0.2 (12.1.0.1 not tested) and higher.

I have to say that when looking in the 11.2.0.4 OCW patch metadata, it says ‘rac’, so the OCW patch might succeed when the home is explicitly installed for a cluster database. I tested with grid infrastructure installed for a single machine (“siha”).

Docker Birmingham – November 2019

https://oracle-base.com/blog/wp-content/uploads/2019/09/docker-birmingha... 300w, https://oracle-base.com/blog/wp-content/uploads/2019/09/docker-birmingha... 768w" sizes="(max-width: 265px) 85vw, 265px" />

Yesterday evening I went to the Docker Birmingham meetup, sponsored by Black Cat Technology Solutions.

This event was a single longer session by Matt Todd called “Make Data Science Great Again (Part 2)” I didn’t see part 1, but he gave a summary, so that wasn’t too much of a problem.

Matt started off by introducing the scientific method and discussed reducing variables when testing, so you know the impact of a change. The suggestion being that development and data science should be the same. What better way to reduce variables than to package up a data science lab to make sure everyone is working on the same thing, so there is no/less variability between researchers, and they can focus on their work, not piecing together the kit.

He then went on to discuss Cloud Native Application Bundles (CNAB), and how they can be used to more reliably package multi-container applications, guaranteeing dependencies to a level greater than that possible by using Docker Compose alone. As an example he demoed his Digital Scientists Lab, which is a CNAB bundle containing a bunch of common kit used by data scientists (Jupyter, Spark, Flink, Kafka, RabbitMQ, Nifi, Elastic Stack etc.), which he could in theory give to several people to run experiments. It’s just his playground, but it gives you an idea of what’s possible. Using Nifi to link stuff together looked cool!

I started to make a few mental connections when he discussed the relationship to experimental data, because I look after the infrastructure for a research publishing system, and being able to keep not only the published research, but also the experimental data and potentially a way to reproduce the research findings is pretty important. It keeps those academics honest, if you know what I mean. It’s funny how just a few links to something you know a little bit about, and other stuff starts falling into place.

It was all a bit mind-blowing, but in a good way. I’m still only scratching the surface of this stuff, but it’s really good to see what else is going on in this space. I’ve added a few more things to my list of things to play around with.

It’s a couple of months until the next event, but there’s a CNCF event next month, so watch out for that!

Thanks to the folks at Black Cat Technology Solutions for sponsoring and organising the event, and to the Matt Todd for doing a great session. See you soon!

Cheers

Tim…


Docker Birmingham – November 2019 was first posted on November 7, 2019 at 8:58 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

Table Space

There’s a request on the Oracle Developer Forum for assistance to write a report that shows the total space allocation for a table – including its indexes and LOB columns. There is no requirement to allow for partitioned tables, but it’s not entirely clear whether the code should allow for index-organized tables and clustered tables, and there’s no comment about nested tables.

The OP has made an attempt to write a suitable query, but the query joins dba_tables to dba_indexes then to dba_lobs then three times to dba_segments (all the joins being outer joins) before aggregating on table name. Unsurprisingly this produces the wrong results because (for example) if a table has two indexes the join to from dba_tables to dba_indexes will double the bytes reported for the table. As Andrew Sayer points out in the thread, it would be better to aggregate on each of the separate component types before joining.

This introduces an important, generic, principle to writing code.

  • If it’s difficult to write a single statement can you break the task down into separate components that are easier to handle.
  • If you can express the problem as a small set of simpler components, can you then combine the components
  • If this approach works is any loss of efficiency balanced by a gain in clarity and maintainability.

In the case of the OP, this isn’t a query that’s going to be runing every few seconds – maybe it will be once per day, maybe three or four times per day. so there’s no 111need to be worried about making it as efficient as possible – so let’s go for simplicity.

Here’s a query that reports the space allocate for a simple heap table. It references dba_tables and dba_segments, so has to be run by a user with DBA privileges, and at the moment it’s restricted to reporting a single user.

Wait – before I write the query I ought to create a testbed to see if the query is working. So let’s take an empty schema and create a few objects. Let’s create

  • a simple heap table with two indexes and two LOB columns.
  • an Index Organized Table (IOT) with an overflow and two LOB columns.
  • a table with two columns that are a nested table type

Here’s some suitable code:


rem
rem     Script:         sum_table_space.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

prompt  ============================================================
prompt  Simple heap table - copy of all_objects - then add two CLOBs
prompt  ============================================================

create table test_user.t1 as select * from all_objects;

alter table test_user.t1 add (c1 clob) lob (c1) store as basicfile t1_lob1 (disable storage in row);
alter table test_user.t1 add (c2 clob) lob (c2) store as basicfile t1_lob2 (disable storage in row);

update test_user.t1 set c1 = 'x', c2 = 'x' where rownum <= 125;

create index test_user.t1_i1 on test_user.t1(object_name, object_type, object_id);
create index test_user.t1_i2 on test_user.t1(object_type, object_name, object_id);

prompt  =========================================================
prompt  Simple IOT table - with a couple of CLOBs in the overflow
prompt  =========================================================

create table test_user.iot1 (
        id              number          not null,
        ind_pad         varchar2(20)    not null,
        n2              number          not null,
        n3              number          not null,
        padding         varchar2(50),
        ct1             clob,
        ct2             clob,
        constraint iot1_pk primary key (id, ind_pad)
)
organization index
        overflow
        including n3
lob(ct1) store as basicfile iot_lob1(
        disable storage in row
)
lob(ct2) store as basicfile iot_lob2(
        disable storage in row
)
;

insert into test_user.iot1
select
        rownum,
        rpad(rownum,20),
        rownum,
        rownum,
        rpad('x',50,'x'),
        case when mod(rownum,100) = 0 then rpad('x',100,'x') end,
        case when mod(rownum,100) = 0 then rpad('x',100,'x') end
from
        all_objects
where
        rownum <= 10000
;

commit;

prompt  ====================================================
prompt  Now create types to allow creation of a nested table
prompt  ====================================================

create type test_user.jpl_item as object (n1 number, v1 varchar2(32));
/

create type test_user.jpl_tab_type as table of jpl_item;
/

create table test_user.demo_nest (
        id      number,
        nest1   test_user.jpl_tab_type,
        nest2   test_user.jpl_tab_type
)
nested table nest1 store as t_nested1
return as locator
nested table nest2 store as t_nested2
return as locator
;

create unique index test_user.tn1_pk on test_user.t_nested1(nested_table_id, n1) compress 1;
create unique index test_user.tn2_pk on test_user.t_nested2(nested_table_id, n1) compress 1;

create index  test_user.tn1_v1 on test_user.t_nested1(v1);
create index  test_user.tn2_v1 on test_user.t_nested2(v1);

insert into test_user.demo_nest values (
        1,
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item One one'),
                test_user.jpl_item(2,'Item One two')
        ),
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item One one'),
                test_user.jpl_item(2,'Item One two')
        )
);

insert into test_user.demo_nest values (
        2,
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item Two one'),
                test_user.jpl_item(2,'Item Two two')
        ),
        test_user.jpl_tab_type(
                test_user.jpl_item(1,'Item Two one'),
                test_user.jpl_item(2,'Item Two two')
        )
);

commit;

You’ll notice that I’ve prefixed every table, index and type name with a schema name. This is because I set up this test to run as a DBA so I’m creating the objects while connected with a different id.

Of course, before trying to report allocations summed by base table, it would be sensible to produce a simple list of the segments we should see so that we know when we’ve reported all of them. So let’s start with that very simple report:


column bytes format 999,999,999,999
column segment_name format a25

break on report on owner on object_name skip 1
compute sum of bytes on report

select
        segment_name, segment_type, bytes
from
        dba_segments
where
        owner = 'TEST_USER'
order by
        segment_type, segment_name
;

SEGMENT_NAME              SEGMENT_TYPE                  BYTES
------------------------- ------------------ ----------------
IOT1_PK                   INDEX                     1,048,576
SYS_C008380               INDEX                     1,048,576
SYS_C008381               INDEX                     1,048,576
SYS_FK0000075579N00002$   INDEX                     1,048,576
SYS_FK0000075579N00004$   INDEX                     1,048,576
T1_I1                     INDEX                     5,242,880
T1_I2                     INDEX                     5,242,880
TN1_PK                    INDEX                     1,048,576
TN1_V1                    INDEX                     1,048,576
TN2_PK                    INDEX                     1,048,576
TN2_V1                    INDEX                     1,048,576
SYS_IL0000075565C00027$$  LOBINDEX                  1,048,576
SYS_IL0000075565C00028$$  LOBINDEX                  1,048,576
SYS_IL0000075572C00006$$  LOBINDEX                  1,048,576
SYS_IL0000075572C00007$$  LOBINDEX                  1,048,576
IOT_LOB1                  LOBSEGMENT                1,048,576
IOT_LOB2                  LOBSEGMENT                1,048,576
T1_LOB1                   LOBSEGMENT                2,097,152
T1_LOB2                   LOBSEGMENT                2,097,152
T_NESTED1                 NESTED TABLE              1,048,576
T_NESTED2                 NESTED TABLE              1,048,576
DEMO_NEST                 TABLE                     1,048,576
SYS_IOT_OVER_75572        TABLE                     1,048,576
T1                        TABLE                    12,582,912
                                             ----------------
sum                                                47,185,920

So we have a list of segments, and we have a sum of bytes to aim for. One thing you might notice, though, is that the name “IOT1”  has “disappeared”, instead the thing we see as a “TABLE” is its overflow segment, called SYS_IOT_OVER_75572 (the number being the object_id of the table we originally defined.  We will have to do something in our code to translate that sys_iot_over_75572 to iot1 if we want to make our final report easy to comprehend.

There are a few other oddities in the list, but some of them we may be familiar with already – the indexes with names like SYS_IL0000075565C00027$$ are the lob indexes for the four lobs we created. (I deliberately created basicfile LOBs in case the OP wasn’t using securefile LOBs that most systems are now using)  The 75565 in the sample here is the object_id of the base table the C00027 tells us that the LOB is column 27 of the table definition.

You may be wondering about the indexes like SYS_C008380 and SYS_FK0000075579N00002$. We may need to work on those – but I’ll tell you the answer straight away – the former is a unique index on the “nest1” column  in the demo_nest table, the latter is a “foreign key” index on the (invisible) nested_table_id column in the nested tables.

Side note: although the sys_fk0000075579N000025 index looks as if it’s supposed to be a “foreign key” index, as suggested by the name, the dependency, and the function, there’s no declared foreign key constraint declared between the parent table and the nested table. Also, if you check dba_indexes, the generated column has the value ‘N’, even though the name of the index was clearly generated by Oracle internal code.  (The index at the other end of the join – sys_c008380 – is, however, flagged with generated = ‘Y’)

Let’s start building our solution with the easiest bit.


select 
        tab.owner, 
        cast('TABLE' as varchar2(11))   object_type,
        tab.table_name,
        nvl(seg_tab.bytes,0)            bytes  
from
        dba_tables        tab  
left outer join
        dba_segments      seg_tab  
on
        seg_tab.owner      = tab.owner
and     seg_tab.table_name = tab.segment_name  
where
        tab.owner          = 'TEST_USER'
/


OWNER           OBJECT_TYPE TABLE_NAME                      BYTES
--------------- ----------- -------------------- ----------------
TEST_USER       TABLE       T1                         12,582,912
                TABLE       SYS_IOT_OVER_75572          1,048,576
                TABLE       DEMO_NEST                   1,048,576
                TABLE       IOT1                                0
***************                                  ----------------
sum                                                    14,680,064

Immediately we have to correct two problems – we can see the original Iiot1 table – but it doesn’t have an associated data segment, it has a “top”, which is the primary key index segment, and an “overflow” which is the designated by Oracle as a table segment. So we need to check the columns in dba_tables to work out how to get the more appropriate name reported for the table segment, and add a predicate to hide the original name. Here’s “tables only mark 2”:

select 
        tab.owner, 
        cast('TABLE' as varchar2(11))   object_type,
        case
                 when tab.iot_type = 'IOT_OVERFLOW' 
                        then tab.iot_name
                else tab.table_name
        end                             object_name,
        case
                 when tab.iot_type = 'IOT_OVERFLOW' 
                        then tab.table_name
        end                             auxiliary_name,
        nvl(seg_tab.bytes,0)            bytes  
from
        dba_tables        tab  
left outer join
        dba_segments      seg_tab  
on
        seg_tab.owner        = tab.owner
and     seg_tab.table_name   = tab.segment_name  
where
        tab.owner            = 'TEST_USER'
and     nvl(tab.iot_type,'NORMAL') != 'IOT'


OWNER           OBJECT_TYPE OBJECT_NAME          AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       TABLE       T1                                                   12,582,912
TEST_USER       TABLE       IOT1                 SYS_IOT_OVER_75572               1,048,576
TEST_USER       TABLE       DEMO_NEST                                             1,048,576

I’ve identified the table type using the iot_type column in dba_tables, hidden rows where the iot_type is ‘IOT’, and reported the iot_name (with table_name as an auxiliary name) when the iot_type is ‘IOT_OVERFLOW’. And we can now check that the result is consistent with the content of dba_segments that lists segment_type = ‘TABLE’.

So let’s move on to indexes.

select 
        ind.table_owner                 owner, 
        cast('INDEX' as varchar2(11))   object_type,
        ind.table_name,
        index_name                      auxiliary_name,
        nvl(seg_ind.bytes,0)            bytes  
from
        dba_indexes       ind  
left outer join 
        dba_segments      seg_ind  
on
        ind.owner      = seg_ind.owner
and     ind.index_name = seg_ind.segment_name  
where
        ind.table_owner = 'TEST_USER'


OWNER           OBJECT_TYPE TABLE_NAME           AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       INDEX       T1                   SYS_IL0000075565C00027$$         1,048,576
TEST_USER       INDEX       T1                   SYS_IL0000075565C00028$$         1,048,576
TEST_USER       INDEX       T1                   T1_I1                            5,242,880
TEST_USER       INDEX       T1                   T1_I2                            5,242,880
TEST_USER       INDEX       IOT1                 IOT1_PK                          1,048,576
TEST_USER       INDEX       IOT1                 SYS_IL0000075572C00007$$         1,048,576
TEST_USER       INDEX       IOT1                 SYS_IL0000075572C00006$$         1,048,576
TEST_USER       INDEX       T_NESTED1            SYS_FK0000075579N00002$          1,048,576
TEST_USER       INDEX       T_NESTED2            SYS_FK0000075579N00004$          1,048,576
TEST_USER       INDEX       DEMO_NEST            SYS_C008380                      1,048,576
TEST_USER       INDEX       DEMO_NEST            SYS_C008381                      1,048,576
TEST_USER       INDEX       T_NESTED1            TN1_PK                           1,048,576
TEST_USER       INDEX       T_NESTED2            TN2_PK                           1,048,576
TEST_USER       INDEX       T_NESTED1            TN1_V1                           1,048,576
TEST_USER       INDEX       T_NESTED2            TN2_V1                           1,048,576

A quick check shows that we’ve picked up the 15 indexes reported by dba_segments – although I’ve labelled them all just as “INDEX” while dba_segments does label LOB indexes differently. As you can see I’ve reported the table name in each case with the index name as the “auxiliary” name. This will be of use when I’m summing up the space associated with each table.

On to the LOBs – also easy:

select 
        lob.owner, 
        cast('LOB' as varchar2(11))   object_type,
        lob.table_name,
        lob.column_name               auxiliary_name,
        nvl(seg_lob.bytes,0)          bytes  
from
        dba_lobs    lob
left outer join 
        dba_segments      seg_lob
on
        seg_lob.owner        = lob.owner
and     seg_lob.segment_name = lob.segment_name
where 
        lob.owner            = 'TEST_USER'


OWNER           OBJECT_TYPE TABLE_NAME           AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       LOB         T1                   C1                               2,097,152
TEST_USER       LOB         T1                   C2                               2,097,152
TEST_USER       LOB         IOT1                 CT1                              1,048,576
TEST_USER       LOB         IOT1                 CT2                              1,048,576

And finally the nested tables:


select 
        nst.owner, 
        cast('NESTED' as varchar2(11))   object_type,
        nst.parent_table_name            object_name,
        table_name                       auxiliary_name,
        nvl(seg_nst.bytes,0)             bytes  
from
        dba_nested_tables    nst  
left outer join 
       dba_segments          seg_nst  
on
        seg_nst.owner        = nst.owner
and     seg_nst.segment_name = nst.table_name  
where
        nst.owner            = 'TEST_USER'

OWNER           OBJECT_TYPE OBJECT_NAME          AUXILIARY_NAME                       BYTES
--------------- ----------- -------------------- ------------------------- ----------------
TEST_USER       NESTED      DEMO_NEST            T_NESTED1                        1,048,576
TEST_USER       NESTED      DEMO_NEST            T_NESTED2                        1,048,576

A quick check tells us that the four pieces of code produce an exact match for all the segments reported in dba_segments – so all we have to do now is stitch the four queries together with UNION ALL, then sum(bytes) by owner and table_name.

There are 3 little problems to deal with though.

  • We have the predicate “where {owner} = ‘TEST_USER'” appearing 4 times in total, which is a bit messy. If we put this outside the UNION ALL will Oracle be smart enough to push it back inside the UNION ALL as part of its query transformation. It seems to, but the plan (which is several hundred lines long – so I’m not going to look too closely) does change a little, but on my little test system it didn’t make much difference to the performance. Initially it looks as if it might be okay to create a view from this query with no restricition on table owner, and allow users to query the view with a schema name.
  • The code to report indexes reports the indexes on the nested tables under t_nested1 and t_nested2 – we want the indexes to be associated with table demo_nest, so we need to refine that bit of the code. It needs an outer join to dba_nested_tables to supply the parent_table_name if it exists.
  • On my little system the query takes several seconds to run. Should I worry about that. Not until I’ve tested it on a much bigger system, and not until I know how frequently it needs to run. It may be good enough as it stands, and the ease with which I actually modified the first version of my code to handle the nested tables indexing issue is an indicator of the benefits of keeping a complex job as the sum of its parts if it’s reasonable to do so.

Let’s go straight to the final (so far) SQL:


select
        owner, object_name, sum(bytes) 
from    (
        select 
                tab.owner, 
                cast('TABLE' as varchar2(11))   object_type,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.iot_name
                        else tab.table_name
                end                             object_name,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.table_name
                end                             auxiliary_name,
                nvl(seg_tab.bytes,0)            bytes  
        from
                dba_tables        tab  
        left outer join
                dba_segments      seg_tab  
        on
                seg_tab.owner        = tab.owner
        and     seg_tab.segment_name = tab.table_name
        where
                tab.owner            = 'TEST_USER'
        and     nvl(tab.iot_type,'NORMAL') != 'IOT'
        union all
        select 
                ind.table_owner                  owner, 
                cast('INDEX' as varchar2(11))    object_type,
                nvl(
                        nst.parent_table_name,
                        ind.table_name
                )                                table_name,
                index_name                       auxiliary_name,
                nvl(seg_ind.bytes,0)             bytes  
        from
                dba_indexes       ind  
        left outer join
                dba_nested_tables nst
        on
                nst.owner       = ind.table_owner
        and     nst.table_name  = ind.table_name
        left outer join 
                dba_segments      seg_ind  
        on
                seg_ind.owner         = ind.owner
        and     seg_ind.segment_name  = ind.index_name 
        where
                ind.table_owner = 'TEST_USER'
        union all
        select 
                lob.owner, 
                cast('LOB' as varchar2(11))   object_type,
                lob.table_name,
                lob.column_name               auxiliary_name,
                nvl(seg_lob.bytes,0)          bytes  
        from
                dba_lobs    lob
        left outer join 
                dba_segments      seg_lob
        on
                seg_lob.owner        = lob.owner
        and     seg_lob.segment_name = lob.segment_name
        where 
                lob.owner            = 'TEST_USER'
        union all
        select 
                nst.owner, 
                cast('NESTED' as varchar2(11))   object_type,
                nst.parent_table_name            object_name,
                table_name                       auxiliary_name,
                nvl(seg_nst.bytes,0)             bytes  
        from
                dba_nested_tables    nst  
        left outer join 
               dba_segments          seg_nst  
        on
                seg_nst.owner        = nst.owner
        and     seg_nst.segment_name = nst.table_name
        where
                nst.owner         = 'TEST_USER'
        )
where
        owner = 'TEST_USER'
group by
        owner, object_name
order by
        sum(bytes)
/

OWNER           OBJECT_NAME                SUM(BYTES)
--------------- -------------------- ----------------
TEST_USER       IOT1                        6,291,456
TEST_USER       DEMO_NEST                  11,534,336
TEST_USER       T1                         29,360,128
                                     ----------------
sum                                        47,185,920


I wrote this note because my response to the original query was an informal suggestion of taking this approach; then I thought it might be worth writing a general comment about solving big problems by starting with little problems and stitching the pieces together using this case as a concrete example – then I just went ahead to finish the thing off within the bourndaries of the original requirements.

There’s a reason, though, why I tend to avoid publishing SQL that might be useful – it’s always possible to overlook bits and end up with people trying to use code that’s not totally appropriate to their system. In this case, of course, there’s a total (but deliberate) failure to handle clusters, partitions and subpartitions, but I don’t know how many other of the “less commonly used” bits of Otacle I haven’t considered. I do know that I haven’t touched on domain indexes (such as text indexes with all their convoluted collections of tables and indexes) but there may be some oddities of (e.g.) advanced queueing, replication, and audit vault that fall outside the range of the possibilities I’ve covered above.

Update 7th Nov 2019

The OP from OTN reported seeing some nested table in the output – which I thought I’d catered for so I modified the code I have on my sandbox to add a nested table to the IOT, and added a couple of extra indexes on the parent of the nested table, changed a couple of columns and object names to quoted mixed case, and everything still seemed to work okay.

It turned out that the OP had nested tables that were nested 5 deep – which means I need a hierarchical query to connect the 2nd to 5th (or nth) layer of nesting back to the parent to report the total volume against the top level table. In order to deal with this problem I modified the query in the smallest possible way.

I had two references to the view dba_nested_tables, so I created a hierarchical query based on dba_nested_tables that returned the “oldest ancestor” with each child table rather than the immediate parent. (I’ve written a separate note showing the derivation and testing). Then I turned this query into a “with subquery” (CTE) called top_nested_tables and changed the two calls that the main query made to dba_nested_tables  so that they referenced the CTE instead:


with my_nested_tables as (
select
        /*+ materialize */
        owner, parent_table_name, table_name
from
        dba_nested_tables
where   owner = 'TEST_USER'
),
top_nested_tables as (                                                             -- ***** Replacement CTE defined
select  /*+ materialize */
        owner, parent_table parent_table_name, table_name
from    (
        select
                owner,
                level,
                connect_by_root parent_table_name parent_table,
                table_name 
        from
                my_nested_tables
        connect by
                parent_table_name = prior table_name
        )
where   (owner, parent_table) not in (
                select owner, table_name
                from   my_nested_tables
        )
)
select                                                                             -- ***** Main select
        owner, object_name, sum(bytes)  total_bytes
from    (
        select 
                tab.owner, 
                cast('TABLE' as varchar2(11))   object_type,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.iot_name
                        else tab.table_name
                end                             object_name,
                case
                         when tab.iot_type = 'IOT_OVERFLOW' 
                                then tab.table_name
                end                             auxiliary_name,
                nvl(seg_tab.bytes,0)            bytes  
        from
                dba_tables        tab  
        left outer join
                dba_segments      seg_tab  
        on
                seg_tab.owner        = tab.owner
        and     seg_tab.segment_name = tab.table_name
        where
                tab.owner            = 'TEST_USER'
        and     nvl(tab.iot_type,'NORMAL') != 'IOT'
        union all
        select 
                ind.table_owner                  owner, 
                cast('INDEX' as varchar2(11))    object_type,
                nvl(
                        nst.parent_table_name,
                        ind.table_name
                )                                table_name,
                index_name                       auxiliary_name,
                nvl(seg_ind.bytes,0)             bytes  
        from
                dba_indexes       ind  
        left outer join
                top_nested_tables nst                                              -- ***** view replacement 
        on
                nst.owner       = ind.table_owner
        and     nst.table_name  = ind.table_name
        left outer join 
                dba_segments      seg_ind  
        on
                seg_ind.owner         = ind.owner
        and     seg_ind.segment_name  = ind.index_name 
        where
                ind.table_owner = 'TEST_USER'
        union all
        select 
                lob.owner, 
                cast('LOB' as varchar2(11))   object_type,
                lob.table_name,
                lob.column_name               auxiliary_name,
                nvl(seg_lob.bytes,0)          bytes  
        from
                dba_lobs    lob
        left outer join 
                dba_segments      seg_lob
        on
                seg_lob.owner        = lob.owner
        and     seg_lob.segment_name = lob.segment_name
        where 
                lob.owner            = 'TEST_USER'
        union all
        select 
                nst.owner, 
                cast('NESTED' as varchar2(11))   object_type,
                nst.parent_table_name            object_name,
                table_name                       auxiliary_name,
                nvl(seg_nst.bytes,0)             bytes  
        from
                top_nested_tables nst                                              -- ***** view replacement 
        left outer join 
                dba_segments          seg_nst  
        on
                seg_nst.owner        = nst.owner
        and     seg_nst.segment_name = nst.table_name
        where
                nst.owner         = 'TEST_USER'
        )
where
        owner = 'TEST_USER'
group by
        owner, object_name
order by
        sum(bytes)
;

If you followed the link to the original Oracle Developer Forum thread you’ll see that the OP has made much more use of the “with subquery” (CTS) construct, – including a rather nice touch of having a CTE that reads “select cast(‘MY_SCHEMA’ as varchar2(30)) from dual” – this is particularly cute because it means you only have to supply a fixed schema name once in the query rather than having to repeat it several times in the course of the query because you can then inject the scemaname everywhere else as needed by a join to the CTE.

Bacon, Bloggin’ and First Day Keynote at PASS Summit 2019!

After a flurry of release announcements at MSIgnite this week, it was time to talk about the technology at a deeper level at PASS Summit.  I love this event, which has my undying loyalty due to the sheer amount of technical focus on everything in the Microsoft data platform and of course, was not disappointed with the first day keynote on Wednesday!

We started out with a great treat of information before the keynote, just so we were prepared for what was to come in the next couple days…plus bacon!

Grant Says

Grant Fritchey started the keynote by welcoming all the attendees and talking about how impressive the event has been even though its just started!  I have to admit, it’s the first day and I could use a nap.  It’s pretty impressive with the amount of networking and technical socializing that happens at the conference.

Keynote Speaker Rohan Kumar

Rohan Kumar is the Corporate VP over Azure Data and is one of those people that is a symbol of what I see inside Microsoft everyday-  engaged, enthusiastic and fully understanding of the unified vision of the technology.  Unlike previous years, the Microsoft folks weren’t held back by waits on releases at MS Ignite, so instead, they were able to dig in deep to what the announcements meant to us as the professionals using the technology.  I. LOVE. THAT.

We’ve been waiting impatiently for SQL Server 2019, but we’re able to talk about not what is private preview or what might become part of the GA release, but what IS in the GA release!  Discussing the importance of the changes and focuses with how SQL Server 2019 will be the management layer over your datalakes, ability to use T-SQL, Python and everything in-between.

#ffffff; color: #333333; cursor: text; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;">Know that soon you’ll be able to run SQL databases where you may never have imagined, such as ARM, (Raspberry Pi!). Rohan then talked about the future of data with Azure SQL Database Edge, that will be the ARM deployment for Azure.  I was sitting in my chair squealing as Rohan showed us an RPI with Edge on it.  I. WANT.  </p />
</p></div>

    	  	<div class=

Cloning a schema with one line

In the world of DevOps, continuous integration and repeatable test cases, the demand for being able to

  • quickly build a suite of database objects,
  • utilise it for a series of tests,
  • then throw the objects away

has become far more common. This is one of the many great use cases for pluggable databases with all of the powerful cloning facilities available. In particular, now that you can take advantage of pluggable databases without* incurring additional license fees, there are some great opportunities there…but that is the topic for another post.

What about if the “unit of work” is not an entire pluggable database. What if we just want to clone just a schema within a pluggable database. Gone are the days where a DBA might be asked to clone a schema once per month, or even once per week. Cloning a schema is now something the developers want to do multiple times per day, as part of an automated process, and not have to involve anyone at all!  Welcome to DevOps! Smile

Unfortunately, we do not yet have a command in the Oracle database which lets you run:

create user SCOTT2 from SCOTT;

so I thought I’d throw something together which hopefully is the next best thing. We can use the PLSQL API into the Datapump feature to facilitate this. Taking advantage of a few concepts, namely:

  • Network based import,
  • The implicit database link associated with a global name,
  • Dynamically alterable external table definitions

we can build a PL/SQL procedure that is our one line clone schema resource for developers.


--
-- if you want to allow the drop user option, then the 
-- the owning schema will need the following privilege
--
-- Needless to say, you might want to wrap this within a procedure
-- within its own rights to ensure people don't drop the WRONG user
--
-- For example:
--
-- create or replace
-- procedure customised_drop_user(p_user varchar2) is
-- begin
--   if .... then
--      execute immediate 'drop user '||p_user||' cascade';
--   else
--      raise_application_error(-20000,'What the hell?!?!?');
--   end if;
-- end;
--

grant drop user to MY_USER;

drop table datapump_clone_log;

--
-- the initial file in the definition (dummy.log) must
-- exist, and the directory you are using (TEMP) must match
-- the declaration in the PLSQL proc which follows
--
create table datapump_clone_log (
     msg varchar2(4000)
)
organization external
( type oracle_loader
  default directory TEMP
  access parameters
  ( records delimited by newline
    fields terminated by ','
    missing field values are null
   ( msg )
   )
   location ('dummy.log')
) reject limit unlimited;

--
-- p_old    = existing schema
-- p_new    = target schema
-- p_drop   = whether we drop the target schema first
-- p_asynch = whether we wait or simply launch the import and return
--
-- I'd recommend p_asynch as false, because in that way, you'll get the
-- import log returned right back to your screen
--
create or replace
procedure clone_schema(
              p_old varchar2, 
              p_new varchar2, 
              p_drop_new boolean default true,
              p_asynch boolean default false
              ) is
  l_handle       number;
  l_status       ku$_status; 
  l_state        varchar2(30);
  l_link         varchar2(128);
  l_job_name     varchar2(128) := upper(p_old)||'_SCHEMA_IMP';
  l_log_file     varchar2(128) := lower(p_old)||'_import.log';
  l_default_dir  varchar2(128) := 'TEMP';
  rc             sys_refcursor;
  l_msg          varchar2(4000);
  
  procedure info(m varchar2,p_dbms_out boolean default false) is
  begin
    dbms_application_info.set_client_info(to_char(sysdate,'hh24miss')||':'||m);
    if p_dbms_out then
      dbms_output.put_line(to_char(sysdate,'hh24miss')||':'||m);
    end if;
  end;
BEGIN
  if p_drop_new then
    begin
      info('Dropping '||p_new,p_dbms_out=>true);
      --
      -- See notes about potentially wrapping this for safety
      --
      execute immediate 'drop user '||p_new||' cascade';
    exception
      when others then
        if sqlcode != -1918 then raise; end if;
    end;
  end if;
  select global_name into l_link from global_name;
  
  l_handle := dbms_datapump.open(
    operation   => 'IMPORT',
    job_mode    => 'SCHEMA',
    remote_link => l_link,
    job_name    => l_job_name);

  dbms_datapump.add_file(
    handle    => l_handle,
    filename  => l_log_file,
    directory => l_default_dir,
    filetype  => dbms_datapump.ku$_file_type_log_file,
    reusefile => 1);

  dbms_datapump.metadata_filter(
    handle => l_handle,
    name   => 'SCHEMA_EXPR',
    value  => '= '''||p_old||'''');

  dbms_datapump.metadata_remap(
    handle    => l_handle,
    name      => 'REMAP_SCHEMA',
    old_value => p_old,
    value     => p_new);

  info('Starting job',p_dbms_out=>true);
  dbms_datapump.start_job(l_handle);

  if not p_asynch then
    loop
      begin
        dbms_lock.sleep(3);
        dbms_datapump.get_status(
          handle    => l_handle,
          mask      => dbms_datapump.ku$_status_job_status,
          job_state => l_state,
          status    => l_status);
          info('l_state='||l_state);
      exception
        when others then
          if sqlcode = -31626 then
             l_state := 'COMPLETED';
          else
             raise;
          end if;
      end;
      exit when (l_state = 'COMPLETED') or (l_state = 'STOPPED');
    end loop;
    info('Final state:'||l_state,p_dbms_out=>true);
  end if;
   
  dbms_datapump.detach(l_handle);

  if not p_asynch then
    open rc for 'select msg from datapump_clone_log external modify ( location ( '''||l_log_file||''' ) )';
    loop
      fetch rc into l_msg;
      exit when rc%notfound;
      dbms_output.put_line(l_msg);
    end loop;
    close rc;
  end if;
    
end;
/
sho err

You can also get the source from my repo here.

Now lets have a look at the routine in action. This from my 18c database.


SQL> set serverout on
SQL> exec clone_schema('SCOTT','SCOTT2');
172055:Dropping SCOTT2
172057:Starting job
172232:Final state:COMPLETED
Starting "MCDONAC"."SCOTT_SCHEMA_IMP":
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 184.1 MB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/TABLESPACE_QUOTA
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/TABLE/TABLE
. . imported "SCOTT2"."BIGT"                            1146660 rows
. . imported "SCOTT2"."DEPT"                                  4 rows
. . imported "SCOTT2"."EMP"                                  14 rows
. . imported "SCOTT2"."SALGRADE"                              5 rows
. . imported "SCOTT2"."BONUS"                                 0 rows
Processing object type SCHEMA_EXPORT/PROCEDURE/PROCEDURE
Processing object type SCHEMA_EXPORT/PROCEDURE/ALTER_PROCEDURE
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type SCHEMA_EXPORT/STATISTICS/MARKER
ORA-39082: Object type PROCEDURE:"SCOTT2"."BLAH2" created with compilation warnings
ORA-39082: Object type PROCEDURE:"SCOTT2"."BLAH" created with compilation warnings
Job "MCDONAC"."SCOTT_SCHEMA_IMP" completed with 2 error(s) at Wed Nov 6 17:21:29 2019 elapsed 0 00:00:33

PL/SQL procedure successfully completed.

Note: If you want to run this on a version of the database below 18c, you can simply break the dynamic external table alteration into an ALTER statement to change the location, and then just query the external table as per normal. All of the rest of the code should work without alteration.

Enjoy!

* To see the details about the new license free options for pluggable databases, check out my video below: