Hi,
Am new to Netezza, can anyone tell me the concept of skew in the Netezza Aritecture. How does that effect the distribution key when creating a table.
I see in the NPS Administrator tool the skew values for different table, its high for some table and low for other.
Would be a great help!!!!!!
Thanks,
Singh
Skew is simply a way to measure the efficiency of the data distribution. Anything below 0.0 is probably what you are shooting for,
As an example, if you distribute randomly, you will often get a skew of 0, meaning that there is no statistical (and perhaps physical) difference in the total rows from SPU to SPU.
However, when using a distribution key, it is expected that some keys may have more records associated with them than others, but on average will usually balance out to an even distribution. This is provided that the given key has a fairly high cardinality compared to the data. If, for example, you chose a boolean value as a key, the data would end up on just two SPUs, meaning that it's not really distributed at all. If even one of the SPUs is "dogpiled" where an inordinate count of records is on it for that table, it will slow down the query. It basically means that the query will only be as fast as the slowest SPU, and if one SPU is overloaded it won't matter how fast the other SPUs did their work.
Another form of skew is "process skew". Let's say you distribute your data on a transaction date, and when it's all done the records are physically distributed as evenly as you would expect. But when you do a query for a given date, the query is very slow. This is because only one SPU (the one containing the data for the date-in-question) is actually doing any work. Dates are notoriously bad choices for a distribution key.
Get a copy of the book "Netezza Underground" from Amazon.com - lots more detail on applications and pitfalls.
Hi David Birmingham,
Thank you for your answer and for your time on this, I am clear about this now.
I have many questions on Netezza, To start with, Can we also install Netezza on Windows, if yes what type of query tools we use and any specific ETL tools to use with.
Coz I am not used to UNIX scripts to that extent, So just curious to know about it.
Thanks,
Singh.
You can also take a look at the Aginity Netezza Workbench which is both a simple query browser and includes some of the administrative capabilities such as distribution skew, etc.
