One of the most significant challenges of ELT-based processing is the need for housekeeping infrastructure. I mean, we will find ourselves needing temporary tables, and that's okay. But we'll also need persistent temporary tables - that is - tables we create as processing resources in context of a given set of operations, that we might keep coming back to. Or that we might want for troubleshooting. We have to admit, a truly temporary table that evaporates at the end of the session is handy for housekeeping but lousy for troubleshooting. In many ways, the "temp table" should be a means to organize our immediate thoughts, like de-duping a resource list or whatnot. Utility-stuff makes them handy. But when we need to debug intermediate results, we need a persistent table. And then we need to ditch it, with some rules.
If we run things in shell-script, whether inside the Netezza Linux host or on a companion Linux box, we'll almost always need temporary files as well. Once we create these, we need a way to get rid of them. And upon creating any of these resources, we need a systematic way to keep them safe, meaning a completely unique way of identifying them. For temporary files, we have many options to timestamp their file names, but for a database table this can be somewhat daunting, considering that we could litter our database with lots of orphaned tables in no time flat. The last thing an operator wants to do is go into a littered database and clean out the trash.
The latest release of the Netezza environment provides for stored procedures, which will formalize an invitation (for some) to black-box everything inside the black box. Do not fall into this trap. The black box is only colored that way, it is not intended to be used that way. In the end, we might have componentized stored procedures that do a lot of handy stuff for us, but if their activities are too hidden, we'll hear the same complaints from operators as we hear now when someone violates Rule #10 and performs bulk-data-processing inside their RDBMS, usually with a stored proc. Don't go there. Use the stored procs in peace, but be kind to your operators. Then they won't call you at midnight for answers.
So it goes with any operational scenario, but if we embrace some simple housekeeping hooks in a frameworked sense, we can avoid all these woes and pitfalls. The clear objective is to get more business functionality and solutions "under air" - that is - let the black box do what it does best - crunch and munch the data at high volumes.
A framework is a systematic way to startup a process, provide common resources for the process, allow the process to consume and leverage the resources as a foundation, and when the process completes, the resources are torn-down and tossed. This allows a given process to simply request a resource from the framework, with the expectation of delivery of course, without having to worry about giving it back, tearing it down or anything else in a housekeeping sense.
However, just to keep the operators happy, there is yet another practical means to avoid this headache, and this is to provide more than one database to support the data flow. Many of you are aware of the simplest form of this, where we have an intake database, a workspace database (for transforms) and a target database such as a repository or mart.
Staging -> Transforms -> Reposit
What this means is that we can intake data from any number of sources and assume that the information is in an unknown or dirty state. We can then apply large-scale transformation (joins, rollups, etc) to the data in massively parallel form. Once completely done, we commit our work to the target with simple table copies. We can afford a table copy in Netezza because of its ability to move data rapidly at the SPU level. Given the right distribution, copying data is a minor penalty when we see what we get back for it:
The ability to control the transformation process in a safe zone so that if it fails or corrupts, it is never committed to the repository. We use this technique in ETL all the time, pull data from a source, transform it and prepare it for insert, then commit it. We need to embrace this for ELT as well, because we need to protect the target from corruption with intermediate results from an incomplete operation.
Once we assume the need for such a backbone, we don't really have a lot of high-functionality to support it. After all, if shell-script can help this along, I suppose any of you could provide a frameworked model in Perl, Java or .NET. But might not want to launch an entire development environment just to support these simple activities. They seem simple because they are. Shell script is good because it is inherently a control language and does not tempt us to do a lot of programming. Since we don't need a lot of programming, this is good.
When we launch an ELT framework, we need a number of different items to provide context. Once is a reference to the aforementioned databases. Understand also that if we have a database that is for our personal development use, it can behave as all three. Once we move toward integration, we would break apart the references and make sure nothing else breaks in the process.
We'll need a holding place for end-of-run teardown. Like a teardown-file we will execute at the end of the run. In this file, we will simply export our commands to tear down the resources we create, as we create them. So if nothing is created, there is nothing to tear down. At the end, we just execute the teardown file and it does the trick.
Now we'll need a way to create assets both as a local resource and as a more visible one. In bulding a local resource, we might create-table, or create-view or create-synonym etc and then call a housekeeping functon with the database name and the resource name so that it's marked for teardown. this is as simple as an nzsql "drop table tablename" statement. Either way, we drop the resource at the end of the run.
We'll also need the more visible form, that is creating a local asset in our Transform database that is identical to the target version of the asset. CTAS magic works for us here, in that we offer up several parameters to a given function, such as the table-to-create name, the "real" table name and the target database. For example, if our transform database is a my_transforms, and the target database is my_target, with a table name of my_table, we would want to create a table thusly
create table my_table_temp as select * from my_target..my_table limit 0;
But we can see a limitation here. We cannot run an application with multiples of these without accidentally stomping on each other. If another thread of this ELT stream is launched (say this one is running behind) it will attempt to create my_table_temp as well, and will fail (but might think it succeeded, the table is there after all) and start to write its data into the same resource that is being used by another thread.
No, the most appropriate way to deal with this is the simplicity of the AUDIT_ID, a bigint value that we capture at the beginning of the framework's run (and yes, it's a simple sequence value we pull from a sequencer on the Transforms database).
this in hand, we now create with confidence:
create table my_table_temp_$AUDIT_ID as select * from my_target..my_table limit 0;
Now we have our very own copy and no other thread, even for debug, will gain access to it. What do we do after this, is we simply echo the drop command to the houskeeping file, and we're done:
echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file
At the end of the run, we'll execute each item in the housekeeping file and this will tear down all the assets we created.
Conversely, we could echo this to another houskeeping_file such as:
echo "drop table my_table_temp_$AUDIT_ID ;" >> $housekeeping_file_operator
And now we have a way to keep the tables around without losing track of them. The "drop" statment is logged to another external housekeeping file that we will not execute at the end of our current run. Rather, the operator can execute the file once a night or once a week, or whatever, and guarantees that the assets can be dropped without any surgical activity from the operator.
We'll do something else with this table name, though, how ahout a FINALIZE file that keeps track of each of these assets? After all, we pulled the definition from the Repository as a means to fill it with data that is destined for the same table in the Repository.
echo "insert into my_table select * from transforms..temp_my_table_$AUDIT_ID" >> $FINALIZE_FILE
echo "generate express statistics on my_table; " >> $FINALIZE_FILE
Now if we never make it to the end, we will toss the FINALIZE_FILE and do nothing. Otherwise we execute the FINALIZE file and commit the temporary tables to the target.
And the order of execution is, of course, FINALIZE first and HOUSEKEEPING second!
Another simple suggestion is that we make this call into a formal function, such as
MY_TABLE=$( create_table_from "my_table" "temp" "$AUDIT_ID" "my_target" )
now we have the variables aligned - the source table in my_target, the "temp" prefix and $AUDIT_ID suffix. We also have something else, as we can now reference our new table in simpler form, following:
insert into $MY_TABLE (
column1,
column 2
)
select * from my_source;
but what if the "my_source" here were created in yet another similar thread - itself creating a temporary asset:
insert into $MY_TABLE (
column1,
column 2
)
select * from $MY_SOURCE;
In this case, the value of "MY_SOURCE" is actually "staging_database..my_source". But we don't need to know that, do we? What if we loaded up another source table called "my_source_test" ? Could we now point the variable $MY_SOURCE to this new table and it remain transparent to the above ELT SQL statement? You see where this leads? Flexibility, portability. troubleshooting etc - all because we embrace a framework that is letting us out of the box.
We can build up resources and not worry about whether they exist locally, remotely, in staging or whatever. Another simple aspect of this approach is this - if we don't get rid if the resource(s) at the end of the run, or the run aborts prematurely for any particular reason, all of the resources we have already built - remain bult and filled - we don't need to start the ELT from the beginning.
Let's say we pull from ten different sources, integrate the data with a workload that takes about an hour (hard churning on billions of records) toward a final target of five reporting tables. If we get to the end and the last table has to abort for any number of reasons, we quit and don't commit. However, our threads are in a condition that allows us to restart from where we left off, and not repeat all that work again, with the added benefit that nothing has been committed to the repository - yet. So we don't lose all the processing time, we just pick up where we left off. Such a scenario requires a manual restart, but the primary takeaway is the ability to checkpoint our work de-facto without ever corrupting the final target. Netezza gives us the power to do these things inside the machine.
How, you might ask, do we determine where we left off so we can pick right back up? Hey, I'm out of space on this one, but later -

