Search

Top 60 Oracle Blogs

Recent comments

Faster DISTINCT operations in 19c

If I gave you a telephone book and asked you to tell me how many distinct street names are present in the book, then the most likely thing you would do is …Wave your mobile phone at me and ask what a “telephone book” is Smile. But assuming you’re old enough to remember telephone books, you’ll probably tackle the task by sorting the entire set in street name order and then work your way down the list and incrementing your count of distinct street names every time it changed. 

In database terms, a little demo of that approach is as below



SQL> create table telephone_book
  2  as select 'Street'||trunc(dbms_random.value(1,30)) street
  3  from dual
  4  connect by level  select street
  2  from   telephone_book
  3  order by 1;

STREET
----------------------------------------------
Street1             == 1st street
Street1
Street10            == 2nd street 
Street10
Street11            == 3rd street 
Street11
Street11
Street11
Street11
Street11
...
...
Street8
Street9             == 28th street
Street9
Street9
Street9

100 rows selected.

which results in our 28 distinct streets.

But what if we were not allowed to sort the data. Perhaps a more realistic question is – what if it was ridiculously prohibitive in terms of time and effort to sort the data? I don’t know about you, but I have better things to do with my weekend than sort the telephone book! Smile

From my demo above, I can see that the highest possible number of distinct streets I could have is 30. So rather than sort the data, I can create a simple 30-character string, which each character represents the occurrence the partnering street.  My string would start off as (the hyphens added for clarity)

NNNNNNNNNN-NNNNNNNNNN-NNNNNNNNNN

and as I read the telephone book, I simply change the string for each street I encounter. If the street for the first row is “Street7”, I change my string to be:

NNNNNNYNNN-NNNNNNNNNN-NNNNNNNNNN

If the second row is “Street22” then the string becomes:

NNNNNNYNNN-NNNNNNNNNN-NYNNNNNNNN

and so on until I have read the entire table. At the end of the exercise, I just count the number of “Y” and that is the distinct count of streets. I never had to sort the data. Here’s a simple code demo of that process:


SQL> set serverout on
SQL> declare
  2    bits varchar2(32) := rpad('N',32,'N');
  3    street_no int;
  4  begin
  5    for i in ( select rownum idx, t.* from telephone_book t )
  6    loop
  7      street_no := to_number(ltrim(i.street,'Street'));
  8      bits := substr(bits,1,street_no-1)||'Y'||substr(bits,street_no+1);
  9      dbms_output.put_line('After row '||i.idx||' map='||bits);
 10    end loop;
 11    dbms_output.put_line('Final='||bits);
 12    dbms_output.put_line('ndv='||(length(bits)-length(replace(bits,'Y'))));
 13  end;
 14  /
After row 1 map=NNNNNNNNNNNNNNNNNNNNNYNNNNNNNNNN
After row 2 map=NNNNNNNNNNNNNNNNNNYNNYNNNNNNNNNN
After row 3 map=NNNNNNNNYNNNNNNNNNYNNYNNNNNNNNNN
After row 4 map=NNNNNNNNYNNNNYNNNNYNNYNNNNNNNNNN
...
...
After row 97 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 98 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 99 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
After row 100 map=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
Final=YYYYYYYYYYYYYYYYYNYYYYYYYYYYYNNN
ndv=28

PL/SQL procedure successfully completed.

and we can verify the result with a standard SQL query:


SQL> select count(distinct street)
  2  from   telephone_book;

COUNT(DISTINCTSTREET)
---------------------
                   28

Of course, such a simple solution masks a lot of complexity in implementing something like this for an arbitrary set of data.

  • How do we know the upper limit on the potential number of distinct rows?
  • How do we rapidly count the number of “Y” or “hits” once we have scanned the data?
  • Have we just shifted the problem to an enormous memory structure?

You’ll be reassured to know that a lot of thought has gone into tackling these issues, and thus taking the simple demo above into a genuine robust implementation within version 19c of the database. Many queries requiring distinct counts can be achieved now without requiring an expensive sort.

Here’s the video walking through exactly what we’ve done in 19c, and what some of the benefits are: