Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Bitmap Index On Column With 212552698 Distinct Values, What Gives? (I’d Rather Be High)

In my previous post on Indexing The Autonomous Warehouse, I highlighted how it might be necessary to create indexes to improve the performance and scalability of highly selective queries, as it might on any Data Warehouse running on an Exadata platform. In the post, I created a Bitmap Index and showed how improve SQL performance […]

You should set OCSID.CLIENTID

each time you grab an Oracle JDBC connection from the pool

For troubleshooting and monitoring performance, you want to follow what happens from the end-user to the database. It is then mandatory to identify the end-user and application from the database session. With Oracle there are some ‘dbms_application_info’ strings to be set, like MODULE, ACTION and CLIENT_INFO. That’s about the tasks in the application code (like identifying the Java class or method from which the SQL statement is prepared) but that’s not about the end-user.

And you should forget about the CLIENT_INFO which is not very useful and rather misleading. OCSID.MODULE and OCSID.ACTION are set from JDBC with Connection.setClientInfo (One reason I find the CLIENT_INFO name misleading is that it cannot be set with setClientInfo). Of course, you can also call ‘dbms_application_info.set_module’ but that’s an additional call to the database (which means network latency, OS context switch,…). Using the JDBC setClientInfo with the OCSID namespace sends this information with the next call.

Now, about identifying the end-user, there’s the session CLIENT_ID (aka CLIENT_IDENTIFIER) that you can also set with Connection.setClientInfo (OCSID.CLIENTID). This one is visible in many Oracle views and follows the database links. Here is an example, I create a demo user and a database link:

connect sys/oracle@//localhost/PDB1 as sysdba
drop public database link PDB1@SYSTEM;
grant dba to demo identified by demo;
create public database link PDB1@SYSTEM connect to SYSTEM
identified by oracle using '//localhost/PDB1';

The following JavaScript (run from SQLcl) connects with a JDBC Thin driver, sets OCSID.MODULE, OCSID.ACTION and OCSID.CLIENTID, and displays CLIENT_IDENTIFIER, MODULE and ACTION from V$SESSION:

script
var DriverManager = Java.type("java.sql.DriverManager");
var con = DriverManager.getConnection(
"jdbc:oracle:thin:@//localhost/PDB1","demo","demo"
);
con.setAutoCommit(false);
function showSessionInfo(){
var sql=con.createStatement();
var res=sql.executeQuery("\
select client_identifier,service_name,module,action,value \
from v$session \
join v$mystat using(sid) \
join v$statname using(statistic#) \
where name='user calls' \
");
while(res.next()){
print();
print(" CLIENT_IDENTIFIER: "+res.getString(1));
print(" SERVICE: "+res.getString(2));
print(" MODULE: "+res.getString(3));
print(" ACTION: "+res.getString(4));
print(" User Calls: "+res.getInt(5));
print();
}
}
showSessionInfo();
con.setClientInfo('OCSID.CLIENTID','my Client ID');
con.setClientInfo('OCSID.MODULE','my Module');
con.setClientInfo('OCSID.ACTION','my Action');
showSessionInfo();
// run a statement through DBLINK:
var sql=con.createStatement();
sql.executeUpdate("call dbms_output.put_line@PDB1@SYSTEM(null)");

I also display the ‘user calls’ from V$MYSTAT. Here is the output:

SQL> .
CLIENT_IDENTIFIER: null
SERVICE: pdb1
MODULE: JDBC Thin Client
ACTION: null
User Calls: 4
CLIENT_IDENTIFIER: my Client ID
SERVICE: pdb1
MODULE: my Module
ACTION: my Action
User Calls: 5

The second execution sees the MODULE, ACTION and CLIENT_IDENTIFIER set with the previous setClientInfo(). And the most important is that the ‘user calls’ statistic has been incremented only by one, which means that setting them did not add any additional roundtrips to the database server.

Now, after the call through database link, I display all user sessions from V$SESSION. I can see my SQLcl (java) with nothing set, the JDBC thin session with MODULE, ACTION and CLIENT_IDENTIFIER, and the DBLINK session (connected to SYSTEM) with only the CLIENT_IDENTIFIER set:

SQL> select username,client_identifier,module,action
2 from v$session where type='USER';
  USERNAME   CLIENT_IDENTIFIER                   MODULE       ACTION
__________ ___________________ ________________________ ____________
SYSTEM my Client ID oracle@db192
SYS java@db192 (TNS V1-V3)
DEMO my Client ID my Module my Action

Following the end-user down to all layers (application, database, remote databases) is great for end-to-end troubleshooting and performance analysis. Set this OCSID.CLIENTID to identify the application (micro-)service and the end-user (like a browser Session ID), for no additional cost, and you will find this information in many performance views:

select table_name, listagg(distinct column_name,', ') 
within group (order by column_name)
from dba_tab_columns
where column_name in ('CLIENT_IDENTIFIER','CLIENT_INFO','CLIENT_ID','MODULE','ACTION')
--and table_name like 'GV%'
group by table_name
order by 2;

You see how the ‘CLIENT_INFO’ is useless (except for an additional level to module/action for SQL Monitor) and how CLIENT_ID(ENTIFIER) is everywhere, including ASH (Active Session history).

With a micro-services architecture, you will have many connections to the database (don’t tell me that each microservice has its own database — databases were invented decades ago when streaming data everywhere was an un-maintainable/un-scalable/errorprone mess, and schemas and views were invented to provide this data-micro-services within the same database system). Then the best practice is to:

  • connect with a dedicated SERVICE_NAME
  • identify the end-user with a CLIENT_ID

and then end-to-end tracing, tuning and troubleshooting will become easy.

Not Just the How of AD with Linux VM/SQL 2019, but the WHY

Azure Directory is available with Linux SQL Server 2019 in Preview and as I was setting it up in my Azure environment on a Linux Red Hat 7.3 VM, I was, as many are, happy that they list the commands for the Azure CLI to set up authentication with Azure Directory, but was concerned, that with so many new to Linux, that they didn’t describe in the steps WHY we were running certain commands or setting best practices around Linux database server design.

The setup expects that you already have a Linux VM and SQL 2019 already up and running. The first step they go into is role assignment for the AD login, setting the AD login up as the VM Administrator.

az vm extension set \     
--publisher Microsoft.Azure.ActiveDirectory.LinuxSSH \     
--name AADLoginForLinux \     
--resource-group myResourceGroup \     
--vm-name myVM

The above command expects you to replace “myResourceGroup” and “myVM” with the correct values for the resource group and Linux VM.

The next step then proceeds to run more AZ commands, but it also expects the DBA or administrator to be familiar with Linux and Linux scripting techniques after NOT using them in the first command. I find this assumption leaving us all open to human error:

https://dbakevlar.com/wp-content/uploads/2019/04/vm_cmds_doc-300x196.jpg 300w, https://dbakevlar.com/wp-content/uploads/2019/04/vm_cmds_doc-768x501.jpg 768w, https://dbakevlar.com/wp-content/uploads/2019/04/vm_cmds_doc.jpg 1583w" sizes="(max-width: 1024px) 100vw, 1024px" />
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/login-using-aad

Let’s talk about variables in an OS environment. In Windows, we can set these at the command line, both at the session or consistently as part of login scripts. We can set this in the GUI in the My Computer, advanced settings, environment variables. This also can be done for Linux, either at the command line, via login scripts referred to as run commands, (bash_rc) or profiles owned by individual logins.

To set these, you simple enter a unique word and assign it a value. Some flavors of Linux require you to “export” or “set” the variable, not just state it-

export  = 
set  = 

You can then verify the variable is set by using the ECHO command and calling the variable with a $ sign before it:

echo $

You can just as easily remove them by the “unset” command, by overwriting them with new values for the variable keyword, logging out or with other profile/run command scripts.

We can use the set and echo command process to make more sense of what is being performed as part of the Linux AD authentication instructions to the VM, too:

username=$(az account show --query user.name --output tsv) 
 
echo $username 
kgorman@microsoft.com
vm=$(az vm show --resource-group SQL2019_grp --name SQL2019RH1 --query id -o tsv)
echo $vm
/subscriptions/00aa000e-xx00x-xxx-x00-x00x-xx00x/resourceGroups/SQL2019_grp/providers/Microsoft.Compute/virtualMachines/SQL2019RH1

Now we can use these variables as part of the third command and know what is being passed into the full command, knowing that we were able to dynamically push Azure CLI commands into the final command:

 az role assignment create --role "Virtual Machine Administrator Login"     --assignee $username --scope $vm  

The output from the command shows that it pulled the original variables into the third command to complete the requirements to build the role assignment. I did a mass replace to protect the IDs, but you get the idea how to verify that the values are

{
  "canDelegate": null,
  "id": "/subscriptions/00aa000e-xx00x-xxx-x00-x00x-xx00x/resourceGroups/SQL2019_grp/providers/Microsoft.Compute/virtualMachines/SQL2019RH1/providers/Microsoft.Authorization/roleAssignments/
0e00e0b0-bf00-0000-b000-000d0e0fe0d0",
  "name": "0e00e0b0-bf00-0000-b000-000d0e0fe0d0",
  "principalId": "d0c00cba-0c0b-00f0-b00e-d00000000000",
  "resourceGroup": "SQL2019_grp",
  "roleDefinitionId": "/subscriptions/
00aa000e-xx00x-xxx-x00-x00x-xx00x/providers/Microsoft.Authorization/roleDefinitions/0c0000c0-00e0-0000-0000-ea0c00e000e0",
  "scope": "/subscriptions/
00aa000e-xx00x-xxx-x00-x00x-xx00x/resourceGroups/SQL2019_grp/providers/Microsoft.Compute/virtualMachines/SQL2019RH1",
  "type": "Microsoft.Authorization/roleAssignments"

A good update for this doc would be to first explain how and why we set variables and then use the default configurations for the session to require less commands having to be entered:

username=$(az account show --query user.name --output tsv) 
az configure --defaults group=SQL2019_grp
az configure --defaults vm=SQL2019RH1
vm=$(az vm show --query id -o tsv)

az vm extension set \     
--publisher Microsoft.Azure.ActiveDirectory.LinuxSSH \     
--name AADLoginForLinux 


"name": "AADLoginForLinux",
   "protectedSettings": null,
   "provisioningState": "Succeeded",
   "publisher": "Microsoft.Azure.ActiveDirectory.LinuxSSH",
   "resourceGroup": "SQL2019_grp",
   "settings": null,
   "tags": null,
   "type": "Microsoft.Compute/virtualMachines/extensions",
   "typeHandlerVersion": "1.0",
   "virtualMachineExtensionType": "AADLoginForLinux"

And in the end, we can check our work to verify we’ve done everything correctly:

az vm list --query '[].{Name:name, OS:storageProfile.osDisk.osType, Admin:osProfile.adminUsername}' --output table

You should see your new AD login now in the list.

This doesn’t take you too far into Linux scripting, but hopefully it answers the WHY you are running the commands that are displayed in the instructions and the order that you’re running them in.



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBAKevlar [Not Just the How of AD with Linux VM/SQL 2019, but the WHY], All Right Reserved. 2019.

The Late Spring Speaking Gauntlet

There are busy times for everyone and if you speak at conferences, the busy times are March,May and November. I am recovering from the early spring rush, and now it’s time to prepare for the late spring one.

I’ve been fortunate enough to be accepted to speak at the following regional SQL Saturdays and look forward to speaking and meeting new folks, along with catching up with conference friends:

SQL Saturday Raleigh, April 27th

  • Optimizing Power BI, Taming of the Shrew, (Analytics)

SQL Saturday Jacksonville, May 5th

  • GDPR, The Buck Stops Here, (data governance)
  • Be a Part of the Solution with Automation, (DevOps)
  • Women in Technology Panel, (lunch session)

SQL Saturday Atlanta, May 18th

  • Essential Linux Skills for the DBA, (Linux)

That will finish May up and I’ll have announcements for June soon, but until then, I’ll just be over here typing furiously on a couple book chapters and keeping the day gig going… </p />
</p></div>

    	  	<div class=

PL/SQL – Don’t mix and match scope

Here’s a simple little PL/SQL block where we call an inner procedure PARAMETER_TESTER from its parent block. Pay particular attention to the parameter we pass to the procedure, and it’s value throughout the execution of that procedure.


SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15     parameter_tester(100);
 16  end;
 17  /
Param came in as: 100
Param left as   : 100
glob_var is now : 1

PL/SQL procedure successfully completed.

Now I’ll slowly extend the code, and just by eyeballing it, see if you can predict what the output will be before looking past the end of the PL/SQL block.



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15     local_var := glob_var;
 16     parameter_tester(local_var);
 17  end;
 18  /
Param came in as: 0
Param left as   : 0
glob_var is now : 1

PL/SQL procedure successfully completed.

So far so good I imagine. The parameter came in as zero, we incremented the global variable which of course had no impact on the parameter. Let’s now up the ante a little.



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15     parameter_tester(glob_var);
 16  end;
 17  /
Param came in as: 0
Param left as   : 1
glob_var is now : 1

PL/SQL procedure successfully completed.

This is perhaps the first one that you might find a little unexpected. Notice that the value of the parameter passed to the procedure has changed within the inner procedure even though it was passed (implicitly) as an IN parameter. People often assume that if you pass anything to a procedure without the IN OUT or OUT specification, then the parameter is “read only” and cannot be touched by code. This is true to the extent that you cannot perform an assignment to that parameter as you can see below



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         param := param + 1;
  9     end;
 10
 11  begin
 12     parameter_tester(glob_var);
 13  end;
 14  /
       param := param + 1;
       *
ERROR at line 8:
ORA-06550: line 8, column 8:
PLS-00363: expression 'PARAM' cannot be used as an assignment target
ORA-06550: line 8, column 8:
PL/SQL: Statement ignored

but that is not the same as saying that the parameter is fixed in value throughout the duration of the call. This behaviour is documented in the PL/SQL language manual in that an IN parameter can be passed by reference rather than a static value.

image

Now I’ll explore some other examples of how you might get caught out by this. I’ll modify the example just slightly now so that I’m passing an expression rather than just “glob_var”.



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15     parameter_tester(glob_var+1);
 16  end;
 17  /
Param came in as: 1
Param left as   : 1
glob_var is now : 1

PL/SQL procedure successfully completed.

Now “normal” service has been resumed, in that the expression is evaluated first and hence is passed by value to the procedure leaving the parameter value unchanged throughout the procedure.

So it would appear an expression will disable the “pass by reference” mechanism? Well, let’s try two more examples



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15    parameter_tester(to_char(glob_var));
 16  end;
 17  /
Param came in as: 0
Param left as   : 0
glob_var is now : 1

PL/SQL procedure successfully completed.

SQL>
SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15    parameter_tester(to_number(glob_var));
 16  end;
 17  /
Param came in as: 0
Param left as   : 1
glob_var is now : 1

PL/SQL procedure successfully completed.

Notice the last one in particular. Even though I had a TO_NUMBER expression around “glob_var”, it was still passed by reference and hence the parameter value changed throughout the execution of the inner procedure. This is because the PL/SQL compiler detected that the the TO_NUMBER function was redundant and optimized it out during the compilation process. That left just a parameter input of “glob_var” which could then be passed by reference.

Bottom line – it’s generally considered poor programming practice in any language to be mix and matching the scope of variables, in this case, using glob_var both within and outside the inner procedure. Trying to subvert the behaviour of the PL/SQL engine by using expressions such as to_char(glob_var) is a recipe for disaster. Who knows what additional optimizations the next release of the PL/SQL compiler will have? Perhaps it will optimize out “glob_val+1” or to_char(glob_var) and hence pass them by reference etc. If your functionality really demands on coding with these blurred scopes, then make sure you perform an assignment to a local variable and pass that to avoid unexpected side-effects.



SQL> set serverout on
SQL> declare
  2
  3     glob_var  int := 0;
  4     local_var int;
  5
  6     procedure PARAMETER_TESTER(param int) is
  7     begin
  8         dbms_output.put_line('Param came in as: '||param);
  9         glob_var := glob_var + 1;
 10         dbms_output.put_line('Param left as   : '||param);
 11         dbms_output.put_line('glob_var is now : '||glob_var);
 12     end;
 13
 14  begin
 15     local_var := glob_var;
 16     parameter_tester(local_var);
 17  end;
 18  /
Param came in as: 0
Param left as   : 0
glob_var is now : 1

PL/SQL procedure successfully completed.

Quick and easy masking

I had a request from a client a while back regarding masking of data. They had an application with sensitive data in the Production environment (where access and audit were very tightly controlled) but the issue was how to respect that sensitivity in non-Production environments whilst still preserving full size data sizes for application testing.

After some conversations about requirements, it turned out that since (even in non-Production environments) all access to application components was logged and audited, the issue was simply protecting against “inadvertent” access to sensitive data. For example, in the application, if I searched for “males with black hair with an interest in technology” I should never see the name “Connor McDonald” on screen in the results, because simply viewing that data could be treated as a breach of privacy.

Addenda: Anyone adding a comment to this post with “Shouldn’t that be males with black hair with a hint of grey” will be banned forever Smile

I suggested Data Redaction as a potential solution, but they wanted the data actually obfuscated within the database. Then I suggested Data Masking which definitely does the job, but they felt this offered far more than their simple requirement of just obfuscate the data and retain the distribution of data to preserve performance characteristics as much as possible.

So ultimately here is the basis of a little routine I created for them to obfuscate their data to satisfy their requirement of not viewing meaningful data in their non-Production environments. Please read the footnote at the end of this post before you consider using this code. The logic here is straight forward

  • Take the alphanumeric characters in a logical order as a single string.
  • Use DBMS_RANDOM to come up with 255 variations of that string and store them in an array.
  • Use the TRANSLATE function to perform a simple obfuscation of the input string.

That translation could then be used during the process of copying data from Production, or as a post-copy task.

First I’ll demo the concept of coming up with randomised strings. Here’s a simple anonymous block to create 10 random strings based on a base string of: ABC….Zabc…z0123456789


SQL> set serverout on
SQL> declare
  2    type rand_list is table of varchar2(62) index by pls_integer;
  3    source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  4    perms  rand_list;
  5    source_variant varchar2(62);
  6  begin
  7    for i in 1 .. 10 loop
  8      source_variant := null;
  9      for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
 10      loop
 11        source_variant := source_variant || substr(source,j.r,1);
 12      end loop;
 13      for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
 14      loop
 15        source_variant := source_variant || substr(source,52+j.r,1);
 16      end loop;
 17      perms(i) := source_variant;
 18      dbms_output.put_line(source_variant);
 19    end loop;
 20  end;
 21  /
xrLCuowbNfvXlMJSBKctPyheZWTnQUjYAdizkFVOIHGsmDRpaEqg5637918402
OkVxApdsLqTlyvgYXQGaSeNPMCjZEJhHRtbcIBrowKUunifzDFmW8013742695
gcVxtNeMdBuRJhvYrnkCIWzoyFwfLpqSATabQGUZKmDOEilPHsXj5601892743
vOUuisITyJjbaQLqfnpFVBoCrYtzHeDAmWExPkNwgScXKRlZGMhd3104879256
VLZkXndmytCWRwKeNgYifvTESzGFhHcqaMJPoljxUsAQrubOpBDI7302189564
VfPLKmObZzBMCpgrjUaQYieRNJykDwtnuWXFxdTEIlScqvohAGsH0138924756
ABMkfozebljnPGqHJhpNwxYEUCuQLyDrOasIgtFZKVmciTXSdvRW5702139864
iNjrmVTlyzFSLswfBvaWKUAHIZOcgMPYXCGqeoDJbhpktERdQunx1976045328
hiYGzxMEZFPcmkutbDlrCeynSITKHNJaXAwfpRsdqLjvQUoWVOgB0479632158
uZpPXOHLAvzricdjtqSQBKVhYoFfmUxEDNCwWlIJsbRgkMeGynaT6451328790

So a simple obfuscation could be:


SQL> select
  2    translate(
  3      'Connor McDonald',
  4      'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
  5      'ehZAFjmtYwWilTRDnCcLrvQqEpakKGPMxXyudVUNsoJOBfzSIbHg2163095784')
  6  from dual;

TRANSLATE('CONN
---------------
ZsNNsO lKAsNaVG

Armed with this, I can create a package which extends this to choose from 255 random strings and use these as a simple obfuscator for source data.


SQL> create or replace
  2  package masker is
  3    function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return  sys.odcivarchar2list pipelined;
  4    function standard_source return varchar2;
  5    function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2;
  6    function mask(p_input varchar2) return varchar2;
  7  end;
  8  /

Package created.

SQL> create or replace
  2  package body masker is
  3    perms sys.odcivarchar2list := sys.odcivarchar2list();
  4
  5  procedure init(p_size int,p_seed int) is
  6    source varchar2(62) := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  7    source_variant varchar2(62);
  8  begin
  9    dbms_random.seed(p_seed);
 10    perms.delete;
 11    perms.extend(p_size);
 12    for i in 1 .. p_size loop
 13      source_variant := null;
 14      for j in ( select rownum r from dual connect by level <= 52 order by dbms_random.value )
 15      loop
 16        source_variant := source_variant || substr(source,j.r,1);
 17      end loop;
 18      for j in ( select rownum r from dual connect by level <= 10 order by dbms_random.value )
 19      loop
 20        source_variant := source_variant || substr(source,52+j.r,1);
 21      end loop;
 22      perms(i) := source_variant;
 23  --      dbms_output.put_line(source_variant);
 24    end loop;
 25  end;
 26
 27  function rand_masks(p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return sys.odcivarchar2list pipelined is
 28  begin
 29    if perms.count < p_size or p_refresh in ('Y','y') then
 30      init(p_size,p_seed);
 31    end if;
 32
 33    for i in 1 .. p_size
 34    loop
 35      pipe row ( perms(i));
 36    end loop;
 37
 38    return;
 39  end;
 40
 41  function standard_source return varchar2 is
 42  begin
 43     return 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
 44  end;
 45
 46  function idx_entry(p_idx int, p_size int default 255, p_refresh varchar2 default 'N',p_seed number default 0) return varchar2 is
 47  begin
 48    if perms.count < p_size or p_refresh in ('Y','y') then
 49      init(p_size,p_seed);
 50    end if;
 51    return perms(p_idx);
 52  end;
 53
 54  function mask(p_input varchar2) return varchar2 is
 55  begin
 56    return translate(p_input,masker.standard_source,masker.idx_entry(ascii(p_input)));
 57  end;
 58
 59  end;
 60  /

Package body created.

Now I’ll test this with some sample data


SQL> create table t ( pk int, n varchar2(50));

Table created.

SQL>
SQL> insert into t values (1,'Connor McDonald');

1 row created.

SQL> insert into t values (2,'Chris Saxon');

1 row created.

SQL> insert into t values (3,'Maria Colgan');

1 row created.

SQL> insert into t values (4,'Bobby Curtis');

1 row created.

SQL> insert into t values (5,'Joel Kallman');

1 row created.

SQL> insert into t values (6,'Steven Feuerstein');

1 row created.

SQL> insert into t values (7,'Connor McDonald');

1 row created.

SQL> select pk, n, masker.mask(n) diddled
  2  from t;

        PK N                                                  DIDDLED
---------- -------------------------------------------------- ------------------------------
         1 Connor McDonald                                    sAJJAe CvnAJjWt
         2 Chris Saxon                                        sweOy RjrAJ
         3 Maria Colgan                                       czEJz BhMbzm
         4 Bobby Curtis                                       nkjjI EpzLBS
         5 Joel Kallman                                       oYfi luiiIuj
         6 Steven Feuerstein                                  CyUrUE SUtUWQyUXE
         7 Connor McDonald                                    sAJJAe CvnAJjWt

7 rows selected.

There we go! A simple obfuscator that runs quite efficiently on source data whilst preserving the distribution of the data.

Footnote: Remember that any masking scheme that does not completely randomize the obfuscation of  source data is not a complete encryption or security solution. Most mechanisms to deduce source data from a manipulated variant of that data involves distribution analysis of letters, n-grams, words and phrases. The moment you preserve distribution of source data for (say) performance and/or optimizer plan preservation, you are allowing for source data to be derived if there is enough of it available to perform that analysis. Remember that this is not a substitute for appropriate security and encryption protections.

Microsoft Webinar on the Future of the DBA- Certifications

Thanks to those that attended of the over 1600 registrations for the Microsoft webinar yesterday on the future of the DBA in the Cloud. This was a fun session for me, as I got to demo one of my favorite skill areas since starting at Microsoft- automation of cloud deployments in Azure.

Bash’in it

Its natural that others would be doing this, but I’m writing all my main scripts in BASH, which prepares the environments, calls the Azure CLI commands and then other scripts, (Power Shell, Json Templates, etc.) from the main “wrapper” script. I’m also deploying, not only the infrastructure or databases, but logical objects and data as part of my deployment, so its pretty fun as this evolves.

So Many Questions

While the pre-recorded webinar was presented to attendees, I was busy answering questions from them. There were so many, that I and two others were barely able to keep up with the sheer onslaught! There were a significant number of questions surrounding certifications for DBAs and developers around Azure certification and how to know what certifications to get and what the path is to achieve certification. I’d like to first point out this great URL from Microsoft learning that can assist:

https://www.microsoft.com/en-us/learning/browse-all-certifications.aspx?certificationtype=role-based

This site, depending on the role and area you are interested in, will create a certification path for the specialist. I think it will provide the answers that many were looking for during the webinar, but if there are more questions around certifications, please don’t hesitate to post in the comments below.

A second set of questions were around what to study if going to migrate to Azure. The first recommendation is to become familiar with the SQL Server Migration Assistant. This tool is incredibly robust and versatile. I use it with customers to ease migration woes from not just SQL Server on-prem to the cloud, but also Oracle, DB2 and even WID, (Window Internal Databases, yeah, those are the databases behind Windows servers) migrations.

This tool will provide you with a full review of your current environment, what definite and potential issues you need to address before you can migrate over the DDL to an Azure DB in the cloud. Once you are happy, it can perform the migration. Once the migration of the DDL is complete, then it can then identify how much data and migrate the data over. As these are single insert processes, I recommend “unchecking” the few “whale in the pond” tables and manually migrate those tables with a different tool, such as AZ Copy via blob storage, Azure Fast Data Transfer or even a BCP process with blog storage.

Skills

Next, on professional skills outside of certification- Azure is on Linux and unlike other database platforms that have always run on Linux, you may not have a Linux admin that can support you, so you may need to know more starting out. Consider learning Linux administration fundamentals, not just how to manage a database on Linux.

For language skills, learn BASH and Python. Both languages are going to support you for the long haul. That’s not saying that your Power shell scripts aren’t going to be transitioned over to Linux hosts. Start working with the Azure Cloud Shell now, creating Cloud storage to migrate your scripts over and update them to support Azure. Learning BASH, as it is a more mature scripting language, (its just older and has a ton of other Unix scripting languages that have fed into it…) will make you more effective.

Python is taking over the world by storm, not just in data science, but in DevOps, too. I do feel that DBAs will be more involved in DevOps in the future and should acquire the skills to make them more valuable to these teams.

Hopefully the following links help get you started and thank you again for attending the webinar. Hopefully we’ll have another one very soon!

Azure Training from Microsoft Learning

Python edx classes

edx Introduction to Linux

Linux Certification from Microsoft Learning



Tags:  


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBAKevlar [Microsoft Webinar on the Future of the DBA- Certifications], All Right Reserved. 2019.

In-table predicates

This note was prompted by a recent email asking about the optimizer’s method for estimating the selectivity of a predicate which compared two columns in the same table – for example:  “where orders.amount_invoiced = orders.amount_paid”. It’s been about 14 years since I wrote “Cost Based Oracle – Fundamentals” so my memory of what I wrote (and whether I even mentioned this case) was rather hazy, so I sent off a quick reply and decided to do a little checking.

It turned out that I’d already written a blog note with a throwaway comment about the estimates and a general workaround for optimizer problems caused by examples of this kind. The comment I made about the estimate was that the selectivity seems to be the smaller of the selectivities of (using the example above) “amount_paid = :unpeekable_bind” and “amount_invoice = :unpeekable_bind”. I’m fairly sure I’ve made similar comments several times in the past, but after replying to the email I started to wonder whether this would still be true if there were histograms on the columns. So I ran up a little test and here, to start things off, is the code to generate the data I used for testing:


rem
rem     Script:         column_equality_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2019
rem     Purpose:

create table t1(
        id      number(8,0),
        n1      number(6,0)
)
;

create table t2(
        id      number(8,0),
        n1      number(6,0)
)
;

create table t3(
        n1      number(6,0),
        n2      number(6,0),
        v1      varchar2(50)
)
;

execute dbms_random.seed(0)

insert into t1
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(10 * abs(dbms_random.normal))     n1
from
        generator       v1
;

insert into t2
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(10 * abs(dbms_random.normal))     n1
from
        generator       v1
;

insert into t3 (n1, n2, v1)
select
        t1.n1,
        t2.n1,
        rpad(rownum,50)
from
        t1, t2
where
        t1.id = t2.id
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T3',
                method_opt  => 'for all columns size 254 for columns v1 size 1'
        );
end;
/

select
        table_name, column_name, num_distinct, density, histogram,
        low_value, high_value
from
        user_tab_cols
where
        table_name in ('T1','T2','T3')
and     column_name in ('N1','N2')
order by
        table_name, column_name
;


TABLE_NAME      COLUMN_NAME     NUM_DISTINCT    DENSITY HISTOGRAM       LOW_VALUE  HIGH_VALUE
--------------- --------------- ------------ ---------- --------------- ---------- ----------
T1              N1                        38     .00005 FREQUENCY       80         C128

T2              N1                        38     .00005 FREQUENCY       80         C126

T3              N1                        38     .00005 FREQUENCY       80         C128
                N2                        38     .00005 FREQUENCY       80         C126


I’ve created two sets of 10,000 rows each of normally distributed data – but taken the absolute values so I’ve only got half the bell curve, and I’ve scaled up by a factor of 10 and truncated. This has given me two similar but slightly different sets of values which happen to cover 38 distinct values each.

I’ve then generated my test set by joining these two tables on the unique (though not declared as such) id column to give a table with the same number of rows and two skewed sets of data. The calls to dbms_stats create histograms on the skewed data sets, and I’ve reported a few significant numbers about the 4 relevant columns.

Looking at the column statistics we have num_distinct = 38 across the board – so my observation from paragraph 2 above would tend to suggest that the optimizer would report 10,000/38 = 263 as the cardinality estimate for the predciate “t3.n1 = t3.n2” (I’m fairly confident that in this case 1/num_distinct will be preferred over using the density from user_tab_cols). But here’s what we get from a call to explain plan:


explain plan for
select
        v1
from
        t3
where
        n1 = n2
;

select * from table(dbms_xplan.display);

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   564 | 32148 |    18   (6)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T3   |   564 | 32148 |    18   (6)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"="N2")

The estimate is 564 – which is a pretty good estimate in this case (the actual result was 552) as the two columns were randomly generated and there’s no correlation between them. Unfortunately this is quite a long way of my assumption of 263, so where did the optimizer get that number from?

Here’s a query (with result set) that you may recognise from an earlier post.


break on report skip 1
compute count of value on report
compute sum of t1_frequency on report
compute sum of t2_frequency on report
compute sum of product on report

column product format 999,999,999

with f1 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T3'
and     column_name = 'N1'
),
f2 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T3'
and     column_name = 'N2'
)
select
        f1.value,
        f1.frequency    t1_frequency,
        f2.frequency    t2_frequency,
        f1.frequency * f2.frequency product
from
        f1, f2
where
        f2.value = f1.value
order by
        f1.value
;



     VALUE T1_FREQUENCY T2_FREQUENCY      PRODUCT
---------- ------------ ------------ ------------
         0          777          768      596,736
         1          806          753      606,918
         2          794          779      618,526
         3          808          763      616,504
         4          752          749      563,248
         5          627          729      457,083
         6          623          628      391,244
         7          584          616      359,744
         8          544          597      324,768
         9          512          546      279,552
        10          441          439      193,599
        11          409          342      139,878
        12          345          370      127,650
        13          318          300       95,400
        14          257          282       72,474
        15          244          242       59,048
        16          214          206       44,084
        17          172          193       33,196
        18          161          140       22,540
        19          113          114       12,882
        20          108           93       10,044
        21           95           81        7,695
        22           72           55        3,960
        23           54           56        3,024
        24           43           36        1,548
        25           38           31        1,178
        26           23           18          414
        27           18           23          414
        28            7           14           98
        29            9           13          117
        30           14           11          154
        31            4            2            8
        32            5            3           15
        33            1            3            3
        35            4            1            4
        37            2            2            4
---------- ------------ ------------ ------------
        36
                   9998         9998    5,643,754


I’m querying the histoggram information for the two columns, and where t3.n1 and t3.n2 have a value in common I’ve reported the two frequencies for that value and the product of the frequencies. For convenience I’ve included a count and a couple of sums to show that there isn’t a perfect match in the set of values for the two columns. The most important number at the bottom of the page, though, is the sum of the products of frequencies of common values. Take that value and divide by 10,000 and you get 564.3754 – compare that with the cardinality estimate of the predicate “t3.n1 = t3.n2”, it’s a perfect match (allowing for rounding).

The query against user_tab_histograms is the query I used to calculate the cardinality of a join where there were frequency histograms on the columns at both ends of the join. The optimizer’s estimate for “intra-table” predicates is consistent with its estimate for joins (in the special cases of “no histograms” and “two frequency histograms”, at least). Viewing it from a slightly different angle: the selectivity of the predicate “n1 = n2” can be derived as “the cardinality estimate for joining t3 to itself” divided by “the cardinality of the cartesian join” (the latter being num_rows * num_rows, of course).

Just as a closing demo – lets generate a plan for the appropriate self-join of t3 and check the cardinality estimate:


explain plan for
select
        t3a.v1, t3b.v1
from
        t3 t3a, t3 t3b
where
        t3a.n2 = t3b.n1
;

select * from table(dbms_xplan.display);


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  5643K|   581M|   138  (83)| 00:00:01 |
|*  1 |  HASH JOIN         |      |  5643K|   581M|   138  (83)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T3   | 10000 |   527K|    13   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T3   | 10000 |   527K|    13   (8)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T3A"."N2"="T3B"."N1")


As expected the (rounded) join cardinality is reported as 5,643K.

So the selectivity of the single table predicate “n1 = n2” will be (5,643,000 / (10,000 * 10,000) = 0.05643 and the cardinality estimate of the single table query will be 10,000 * 0.05643 = 564.3 QED.

I haven’t tested any other variations of types of histogram, degree of overlap of value ranges, etc. but I suspect that the general principle is probably going to give the selectivity as (or with the appearance of): “estimated cardinality of self-join” / “square of num_rows (allowing for nulls)”.

 

APEX Upgrade redux

I posted about my APEX upgrade to 19 yesterday, and someone was quick to point out to me that they believed I hadn’t covered all of the steps.

“What if your APEX instance needs to call web services?” they said. “You need to update your Access Control Lists.”

I hadn’t thought of that, so I logged onto one of my other APEX instances that was still at version 18, and checked the current ACLs



SYS> select acl,
  2         principal,
  3         privilege
  4  from   dba_network_acl_privileges
  5  order by acl, principal, privilege;

ACL                                                PRINCIPAL            PRIVILEGE
-------------------------------------------------- -------------------- ----------
/sys/acls/oracle-sysman-ocm-Resolve-Access.xml     ORACLE_OCM           resolve
NETWORK_ACL_192DBA180AEB40AD98A73ACCD309FF8F       APEX_180200          http
NETWORK_ACL_296C00CF7F2744BAB526D4C4E85FE189       GGSYS                resolve
NETWORK_ACL_296C00CF7F2744BAB526D4C4E85FE189       GSMADMIN_INTERNAL    resolve

4 rows selected.

I can see the potential issue here. On upgrade, I’ll have a new schema which would need the same ACLs granted as the existing APEX 18 schema. Then I went ahead and upgraded this instance to version 19, and lo and behold, check out this nice little touch in the upgrade.


SYS> select acl,
  2         principal,
  3         privilege
  4  from   dba_network_acl_privileges
  5  order by acl, principal, privilege;

ACL                                                PRINCIPAL            PRIVILEGE
-------------------------------------------------- -------------------- ----------
/sys/acls/oracle-sysman-ocm-Resolve-Access.xml     ORACLE_OCM           resolve
NETWORK_ACL_0F93A8F653EC43DC9D90457B1151A330       APEX_190100          http
NETWORK_ACL_192DBA180AEB40AD98A73ACCD309FF8F       APEX_180200          http
NETWORK_ACL_296C00CF7F2744BAB526D4C4E85FE189       GGSYS                resolve
NETWORK_ACL_296C00CF7F2744BAB526D4C4E85FE189       GSMADMIN_INTERNAL    resolve

5 rows selected.

The upgrade took care of the ACLs for me! That’s pretty cool.

Ansible tips’n’tricks: provision multiple machines in parallel with Vagrant and Ansible

Vagrant is a great tool that I’m regularly using for building playground environments on my laptop. I recently came across a slight inconvenience with Vagrant’s Virtualbox provider: occasionally I would like to spin up a Data Guard environment and provision both VMs in parallel to save time. Sadly you can’t bring up multiple machines in parallel using the VirtualBox provisioner according to the documentation . This was true as of April 11 2019 and might change in the future, so keep an eye out on the reference.

I very much prefer to save time by doing things in parallel, and so I started digging around how I could achieve this goal.

The official documentation mentions something that looks like a for loop to wait for all machines to be up. This isn’t really an option, I wanted more control over machine names and IP addresses. So I came up with this approach, it may not be the best, but it falls into the “good enough for me” category.

Vagrantfile

The Vagrantfile is actually quite simple and might remind you of a previous article:

  1 Vagrant.configure("2") do |config|
  2   config.ssh.private_key_path = "/path/to/key"
  3 
  4   config.vm.define "server1" do |server1|
  5     server1.vm.box = "ansibletestbase"
  6     server1.vm.hostname = "server1"
  7     server1.vm.network "private_network", ip: "192.168.56.11"
  8     server1.vm.synced_folder "/path/to/stuff", "/mnt",
  9       mount_options: ["uid=54321", "gid=54321"]
 10 
 11     config.vm.provider "virtualbox" do |vb|
 12       vb.memory = 2048
 13       vb.cpus = 2
 14     end
 15   end
 16 
 17   config.vm.define "server2" do |server2|
 18     server2.vm.box = "ansibletestbase"
 19     server2.vm.hostname = "server2"
 20     server2.vm.network "private_network", ip: "192.168.56.12"
 21     server2.vm.synced_folder "/path/to/stuff", "/mnt",
 22       mount_options: ["uid=54321", "gid=54321"]
 23 
 24     config.vm.provider "virtualbox" do |vb|
 25       vb.memory = 2048
 26       vb.cpus = 2
 27     end
 28   end
 29 
 30   config.vm.provision "ansible" do |ansible|
 31     ansible.playbook = "hello.yml"
 32     ansible.groups = {
 33       "oracle_si" => ["server[1:2]"],
 34       "oracle_si:vars" => { 
 35         "install_rdbms" => "true",
 36         "patch_rdbms" => "true",
 37       }
 38     }
 39   end
 40 
 41 end

Ansibletestbase is my custom Oracle Linux 7 image that I keep updated for personal use. I define a couple of machines, server1 and server2 and from line 30 onwards let Ansible provision them.

A little bit of an inconvenience

Now here is the inconvenient bit: if I provided an elaborate playbook to provision Oracle in line 31 of the Vagrantfile, it would be run serially. First for server1, and only after it completed (or failed…) server2 will be created and provisioned. This is the reason for a rather plain playbook, hello.yml:

$ cat hello.yml 
---
- hosts: oracle_si
  tasks:
  - name: say hello
    debug: var=ansible_hostname

This literally takes no time to execute at all, so no harm is done running it serially once per VM. Not only is no harm done, quite the contrary: Vagrant discovered an Ansible provider in the Vagrantfile and created a suitable inventory file for me. I’ll gladly use it later.

How does this work out?

Enough talking, time to put this to test and to bring up both machines. As you will see in the captured output, they start one-by-one, run their provisioning tool and proceed to the next system.

$ vagrant up 
Bringing machine 'server1' up with 'virtualbox' provider...
Bringing machine 'server2' up with 'virtualbox' provider...
==> server1: Importing base box 'ansibletestbase'...
==> server1: Matching MAC address for NAT networking...

[...]

==> server1: Running provisioner: ansible...

[...]

    server1: Running ansible-playbook...

PLAY [oracle_si] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server1]

TASK [say hello] ***************************************************************
ok: [server1] => {
    "ansible_hostname": "server1"
}

PLAY RECAP *********************************************************************
server1                    : ok=2    changed=0    unreachable=0    failed=0   

==> server2: Importing base box 'ansibletestbase'...
==> server2: Matching MAC address for NAT networking...

[...]

==> server2: Running provisioner: ansible...

[...]
    server2: Running ansible-playbook...

PLAY [oracle_si] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [server2]

TASK [say hello] ***************************************************************
ok: [server2] => {
    "ansible_hostname": "server2"
}

PLAY RECAP *********************************************************************
server2                    : ok=2    changed=0    unreachable=0    failed=0

As always the Ansible provisioner created an inventory file I can use in ./.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory. The inventory looks exactly as described in the |ansible| block, and it has the all important global variables as well.

$cat ./.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory
# Generated by Vagrant
server2 ansible_host=127.0.0.1 ansible_port=2201 ansible_user='vagrant' ansible_ssh_private_key_file='/path/to/key'
server1 ansible_host=127.0.0.1 ansible_port=2200 ansible_user='vagrant' ansible_ssh_private_key_file='/path/to/key'

[oracle_si]
server[1:2]

[oracle_si:vars]
install_rdbms=true
patch_rdbms=true

After ensuring both that machines are up using the “ping” module, I can run the actual playbook. You might have to confirm the servers’ ssh keys the first time you run this:

$ ansible -i ./.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory -m ping oracle_si
server1 | SUCCESS => {
"changed": false,
"ping": "pong"
}
server2 | SUCCESS => {
"changed": false,
"ping": "pong"
}

All good to go! Let’s call the actual playbook to provision my machines.

$ ansible-playbook -i ./.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory provisioning/oracle.yml 

TASK [Gathering Facts] ***************************************************************
ok: [server1]
ok: [server2]

TASK [test for Oracle Linux 7] *******************************************************
skipping: [server1]
skipping: [server2]

And we’re off to the races. Happy automating!