Skip navigation

Gather 'round the Grill

1 Post tagged with the external tag
0

Honor the Host

Posted by David Birmingham May 26, 2009

Some enterprises will stand up a Netezza machine and point all their data processing towards it. They wouldn't think of actually installing anything on the Netezza machine (such as database clients or other client software) and of course, are strongly advised against by the vendor. Why is this? The Netezza host has a lot of work to do in keeping those spinning SPUs happy and busy. Adding other duties can detract from this critical mission, and we don't want that.

 

But we can also abuse the host in subtle ways. A case in point follows - you may have other tales to tell.

 

We always have a need to pull in a wide variety of files. In this particular case, dozens of intake tables in their various staging locations. In many installations, the intake table definitions are few, discrete and stable. But in just as many, the staging tables will mirror the upstream sources, with one table for each upstream interface. In our case, handling source-to-target with no ETL in between. We extract directly from the source into an intake table definition that mimics source column names, but the data types are all varchar to facilitate "dirty" intake. The objective is to get the data into the machine.

 

Then we convert this intake table to its final form, the internal Netezza table that is identical to the source table in column name and type. This conversion is a simple table copy, mechanically speaking, but we have to do some light ELT to make it happen. For example, we need to guard against nulls, empty strings, bogus numeric values and the like. In our case, numerics could be dozens of characters in width because the upstream definition happened to be a view with no defined precision. A typical intake SQL could look like:

 

select

case when column is null then value else column end,
case when translate('-+.0123456789','') = '' then column else null end,

etc

 

Such that each column is wrapped with this kind of logic (call it "Intake ELT"). Now, we don't manually wrap these column defs, we do it dynamically from the Netezza catalog definition. (And for efficiency, we cache it for later reuse, but that's another story).

 

Now we have an intake-ELT that looks thus:


External Database Table -> network -> intake table ->  Intake ELT -> Staging Table

 

Note for clarity - the External Database Table and Staging Table are "book ends" to this operation, and have the same column names, data types and column order. We don't absolutely require common column ordering, but it's handy for troubleshooting.

 

Note also that this works just as well for flat file intake as database intake. Better, in fact, because we can more easily load multiple files at once than multiple tables at once (the database might not like multiple extracts)


All of this worked swimmingly until we encountered a slightly different kind of data feed, one that had to be extracted from an archival source into flat files. Rather than present the flat file as normal (on the network) the admins decided to use the available on-board Netezza storage pad (5 TB of space). Keep in mind that we were not allowed to execute anything directly on the machine, so we had to set up External Tables on top of these files to load them, rather than using NZLOAD. This, too, worked transparently and all was well. Then a "bright idea" occurred, that in the above equation the Intake ELT faced a table (our intake table) and couldn't we just use the intake ELT right on top of the External Table, eliminating the additional middle-man?

 

Like so:


flat File -> External Table -> Intake ELT -> Staging Table


The above configuration only appears more efficient by eliminating the Intake table. Looks are quite deceiving, cconsidering how much "per-column work" the Intake ELT had to perform to get data into the Staging Table. What is not obvious, is that the Intake ELT is now sitting on top of the External Table, which is a Host-managed table, not a SPU-managed table. In this configuration, we have reduced our power from a 108-SPU problem to a 4-(Host) CPU problem. The immediate loss of power was measurable in orders of magnitude.

 

So under the covers, here's the power-plant difference in the two models:

 

External Database Table -> network -> intake table ->  Intake ELT -> Staging Table
                          |----HOST -------------|---SPUs--------------------------------|

 

flat File -> External Table -> Intake ELT -> Staging Table
             |------HOST ------------------------------|


So we can see that the second model is abusing the host with the Intake ELT, and if we go with the original model, the ELT will be handled by the SPUs, offering the necessary scalability and power. In a continuum, we can see where we might initially install nzload or external tables and perhaps "tweak" them along the way. Then a maintenance developer comes along and sees that the "easiest" place to add a fix is in the external table or the nzload rather than pushing it to SPUs. The external table and nzload can (and should) do light-intake formatting per their interface specifications, but no further.

 

The over-arching directive remains the same - get the data into the SPU-based tables as rapidly as possible and then do the "dirty-work" with massively parallel power.