When punting data around inside our magical machine, one may wonder how to keep track of it all. Some will eschew ELT because it boils down to a pile of SQL statements, and it sometimes feels out of control. Control of course, is what we make of it. Even a well-defined development product is no match for someone who doesn't like controls.
However, we know this really does boil down to insert/select combinations like so:
Insert into Mytarget (
column list here
)
select
yet another column list
from some tables using complex join and filter logic
It seems we have a handle on the top and bottom, but the "select" clause is where the primary transform and integrations are applied. Things can get really ugly here, especially if we're moving from one legacy platform to another. Our select-statements will look very hairy, indeed.
The "insert" clause is largely along for the ride.
Now it doesn't seem likely that this could get out of control until we're presented with tables that contain, say, a hundred columns. Or even fifty, or say twenty-five. Just enough you see, to keep them from appearing on the same editor page. We might want to add a column to the mix. Hey, add it down below in the select - and make sure you add it in the top to the insert - and don't get anything out of order! And what of the columns are misalgned - data corruptions are a higher-than-everage danger here.
It feels a little primitive, but all we really need is some assistance on the source-to-target mapping and we're good to go. It's impractical to do a source-to-target with unweildy insert/select statements, so let's apply a little Netezza magic. Now, considering that the cost of an ELT statement can sometimes run into minutes of execution time, sacrificing a few extra seconds up front, just to support our weary eyes.
Let's say we automate the scenario a little bit. I have a table called customers that I want to roll together from our old customers and customer-properties tables. The target table is a reporting table with denormalized stuff to support our ad-hoc folks with a lot of pre-calculated goodies. Once we have the calculations, we want to put them into business visibility.
insert into Customers (
customer_id,
first_name,
last_name,
most_recent_purchase_dt,
total_purchases_ctr
lots of other columns here
)
select
c_id,
f_name,
l_name,
max(b.purchase_dt),
sum(b.daily_purchase_ctr),
lots of other columns here
from
old_customer_table a,
old_properties_table b
where a.customer_id = b.customer_id
group by a.customer_id, l_name, f_name, etc;
We can readily see that this very typical SQL statement is doing some heavy lifting for us, just like we want it to do inside the machine. But what if the inser/select clauses had a lot more columns? It wouldn't take much for this to feel nervous about its maintainability. What if we have to interate another table to the mix? Left outer joins? The Select clause has pretty much unlimited potential for complexity.
ANSI SQL supports aliases, so let's run with that. We have our source columns in the Select and the Target columns in the Insert, so let's align them thusly (I'll just use the first few for brevity)
select
c_id customer_id,
f_name first_name,
l_name last_name,
max(b.purchase_dt) most_recent_purchase_dt,
sum(b.daily_purchase_ctr) total_purchases_ctr,
$AUDIT_ID audit_id
from
old_customer_table a,
old_properties_table b
where a.customer_id = b.customer_id
group by a.customer_id, l_name, f_name, etc;
And lo, we have the makings of a source-to-target map. Don't we? Of course - the Insert-columns appear on the right, ready to functionally redefine the souce values on the left. We do this all the time, don't we? But largely for spontaneous reports and the like. Let's look a little further, because having something like this in open-text doesn't really benefit us.
By circumscribing it with a "cat" we can gain two major benefits without sacrificing clarity - one is the ability to put the SQL statement into a place where we can use it, and one is to provide a means to resolve any $ variables that happen to be in the SQL statement. Note the use of the AUDIT_ID variable.
MY_SELECT=$( cat <<!
select
c_id customer_id,
f_name first_name,
l_name last_name,
max(b.purchase_dt) most_recent_purchase_dt,
sum(b.daily_purchase_ctr) total_purchases_ctr,
$AUDIT_ID audit_id
from
old_customer_table a,
old_properties_table b
where a.customer_id = b.customer_id
group by a.customer_id, l_name, f_name, etc;
!
)
Okay, now we have some options - so let's try this:
nzsql -a <<!
$MY_SELECT limit 0 ;
!
this lets us test the SQL statement, but only to make sure we formatted it right. Now let's do something more useful:
nzsql -a <<!
create table temp_target as $MY_SELECT limit 0 ;
!
now we have a persistent table in catalog, with correctly named and sequenced columns that align with the select statement. Note that the columns on the catalog will also have expected data types, which we could check against the target table's data types for consistency, but for now we just need something that the system will accept without complaining.
I'm a big fan of letting the Netezza environment do the heavy lifting. We could set up a parsing function to rip through our SELECT statement and find the alias'd column names, but this will fall apart with the more complex SQL statements. We already have a highpowered SQL parser at our disposal, don't we? And doesn't the CTAS have a thousand-and-one uses, after all?
Let's do a CTAS like the above - with "limit 0" - meaning that it won't do any real processing work, but will give us the power of its parsing engine to find the target columns with the added benefit of registering them by name and in the proper order - but to a temporary table
Now let's put the CTAS together with a way to pull the columns off the catalog - I'm throwing this to a flat file for debugging, but you probably know how to stream this directly into a loop - to follow
nzsql -A -t -o outputfile.txt <<!
create temp table temp_target as $MY_SELECT limit 0 ;
select attname from _v_relation_column where name=upper('temp_target');
EOF
Now let's pull this file into a quick loop,
M_SEP=""
foreach line in outputfile,txt
do
INSERT_STR=$INSERT_STR $M_SEP $line
M_SEP=","
done
or how about
INSERT_STR=$( nzsql -q -A -t <<!
create temp table my_temp as $MY_SELECT limit 0;
select
case when attnum = 1 then ''
else ',' end ||
attname
from _v_relation_column where name = upper('temp_table') order by attnum ;
!
INSERT_STR="insert into TARGET_TBL ("${INSERT_STR}")"
and form it into a string INSERT_STR that looks like this:
customer_id
, first_name
, last_name
, most_recent_purchase_dt,
, total_purchases_ctr
Now what? Execute the the Insert?
nzsql -a <<!
$INSERT_STR $MY_SELECT ;
!
------------------------------------------------
If we put the above activities into a bash function call we would find a setup like this:
nz_insert_from_select()
{
put all the above activities in here
}
-------------------------------------------------------------------------------------------------------------------
So here is what we would implement for any given ELT - we get a visual source-to-target map
MY_SELECT=$( cat <<!
select
c_id customer_id,
f_name first_name,
l_name last_name,
max(b.purchase_dt) most_recent_purchase_dt,
sum(b.daily_purchase_ctr) total_purchases_ctr,
$AUDIT_ID audit_id
from
old_customer_table a,
old_properties_table b
where a.customer_id = b.customer_id
group by a.customer_id, l_name, f_name, etc;
!
)
mret=$( nz_insert_from_select target_table "$MY_SELECT" )
------------------------------------------------------------------------------------
So in ELT space, one of the keys is to balance how much we need to program versus how much is already programmed for us - in the Netezza parsing engine for starters. Catalog-hits are inconsequential when compared to the functional benefit we achieve, and the visually-aligned columns names even for very large tables. We can then add or delete columns from the ELT by adding or deleting lines in the Select. We dont have to align the columns on the top (Insert) and bottom (Select) because they are side-by side - and we know exactly what is going where.