Parsing freeform data in flat files

SQL loader is a very cool utility that has existed for a long time within Oracle to load flat files into the database. However sometimes people find the control file syntax quite cryptic, and when it comes to passing very complicated structures, this can mean control files which are hard to maintain. For me the best solution here is to use an external table. That way we can combine the power of the SQL Loader control file syntax embedded within the external table definition, along with the full power of PL/SQL and SQL for additional parsing of that data.

Here is an example where the data is spread across multiple lines and the task is to bring all that data together into a natural form, namely an ID followed by text.

So here is my file that has free format text


10,some data
that is really 
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we 
will assume that numerics then comm as 
is the thing that tells us it is a new line. 
40,otherwise we will keep on going.

What I’m going to do is use an external table to bring that data in as it is within the file, and then use a PL/SQL function in pipelined mode to parse the data into the component attributes.


SQL>
SQL> create table freeform
  2  (
  3  line varchar2(200)
  4  )
  5  ORGANIZATION external
  6  (
  7  TYPE oracle_loader
  8  DEFAULT DIRECTORY temp
  9  ACCESS PARAMETERS
 10  (
 11  RECORDS DELIMITED BY NEWLINE
 12  FIELDS LDRTRIM
 13  (
 14  line char(200)
 15  )
 16  )
 17  location
 18  (
 19  'freeform.dat'
 20  )
 21  )REJECT LIMIT UNLIMITED ;

Table created.

SQL>
SQL> select * from freeform;

LINE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10,some data
that is really
annoying because it
spans lots of rows.
20,and some more than does not.
30,and a mixture of stuff and we
will assume that numerics then comm as
is the thing that tells us it is a new line.
40,otherwise we will keep on going.

9 rows selected.

SQL>
SQL> create or replace
  2  type output_data as object (id int, str varchar2(500))
  3  /

Type created.

SQL>
SQL> create or replace
  2  type output_list as table of output_data
  3  /

Type created.

SQL>
SQL> create or replace
  2  function cleanup_that_sucker(rc sys_refcursor) return output_list pipelined is
  3    type strlist is table of varchar2(200) index by pls_integer;
  4    s strlist;
  5    o output_data := output_data(null,null);
  6    c pls_integer := 0;
  7  begin
  8    loop
  9      fetch rc bulk collect into s limit 500;
 10
 11      for i in 1 .. s.count loop
 12        c := c + 1;
 13        if regexp_like(s(i),'^[[:digit:]]+,') then
 14          if c > 1 then
 15             pipe row (o);
 16          end if;
 17          o.id  := substr(s(i),1,instr(s(i),',')-1);
 18          o.str := substr(s(i),instr(s(i),',')+1);
 19        else
 20          o.str := o.str || ' ' || s(i);
 21        end if;
 22      end loop;
 23      exit when rc%notfound;
 24    end loop;
 25    pipe row (o);
 26    return;
 27  end;
 28  /

Function created.

SQL> sho err
No errors.
SQL>
SQL> select *
  2  from cleanup_that_sucker(cursor(
  3          select line from freeform
  4          ));

        ID STR
---------- ------------------------------------------------------------------------------------------------------------------------
        10 some data that is really annoying because it spans lots of rows.
        20 and some more than does not.
        30 and a mixture of stuff and we will assume that numerics then comm as is the thing that tells us it is a new line.
        40 otherwise we will keep on going.

4 rows selected.

SQL>
SQL>

And there we have it. We have the full power of PL/SQL and SQL at our fingertips even though we are parsing flat file data. This avoids complicated control file syntax and makes ongoing maintenance far easier. Performance is still good because we are using the bulk collect feature in PL/SQL. Also because we are taking a cursor expression as input, we have lots of flexibility in terms of what data are we passing in. We simply change the SQL expression in the cursor.