Feb 22, 2012 12:17 PM
Does distribute on clause makes any difference while accessing the rows
-
Like (0)
Hi All
I am from teradata background and working with Netezza nowadays.I have below query regarding distribution on clause :
As in teradata primary index is the most efficient path to reach to a row. Does distribution on clause is also play role while fetching rows ?
I tried this with one table but even though i use distribution column in where or not , it shows me sequential scan only in explain plan !!!!!!
Do we have anything like secondary indexes in netezza ?
What all factors we need to consider while deciding distribute on clause ?? like in teradata we consider below points:
Join access frequency
Value access frequency
Value access rows
Distinct rows
Max rows per value
Max rows null
Typical rows per value
Change rating.
Thanks in advance
Nehal
The DISTRIBUTE ON clause is identical to a Teradata Primary Index definition. They both are used to determine which SPU (netezza) or AMP (teradata) holds the data. Neither are actually indexes in the traditional sense of the word.
Netezza does have any indexes. It does have zone maps, which help reduce the number of disk blocks read, but they are not indexes.
There are two things to consider when specifying distribution (RANDOM is the default distribution):
1. Skew - how even (numbers of rows) is the distribution across spus. Ideally, each spu should hold the same number of rows as other spus. Differences of 10% between the smallest and highest population are often considered worst case and should not be exceeded.
2. Co-located joins - if two tables are commonly joined and are large (depends on system, but usually 10GB or more) join query performance can be significantly improved by distributing both tables on the join columns so that a SPU broadcast is not necessary to join the two tables. ALL distribute on columns must be used in the join, and the join must be an equi-join for a co-located join to occur, so usually the fewer distribute on columns specified the better, provided it results in a reasonably smooth distribution. If the distribution choice causes a highly skewed distribution, it is often better off to leave the distribution random.
In the case of smaller tables, particularly in dimensional models, random distibution usually works fine. When implementing dimensional models (which work very well in Netezza, whereas they don't work well at all on Teradata) it is common to distribute the fact table randomly and the dimensions on their surrogate PK. When you have a very large dimension, you may sometimes distribute the facts on that dimension's FK (or one of the FK's if there are multiple role based references to that dimension). Again, it is important to be aware of data skew when doing so.
Thanks Nick for your reply.
I agree with you that primary index in teradata determines the data distribution but if we use primary index columns in the where clause of a query, teradata will calculate the hash value and will directly jump to the particualr AMP using hashmap. In this case it will not do full table scan.
So my question is distribution on also does the same or Netezza does full table scan/Sequential scan always ?
Thanks
Nehal
On Netezza it doesn't matter. A query is always executed syncronously across all spus.
Since spus operate syncronously, skew becomes important. If one spu holds twice as much data as the other spus, it will take longest to complete its share of the work. The total run time is tied to the longest running spu.
Thanks Nick for your reply.

