Skip navigation
2630 Views 4 Replies Latest reply: Feb 23, 2012 10:31 AM by nehal.sheth@twcable.com RSS
Currently Being Moderated

Feb 22, 2012 12:17 PM

Does distribute on clause makes any difference while accessing the rows

Hi All

 

I am from teradata background and working with Netezza nowadays.I have below query regarding distribution on clause :

 

As in teradata primary index is the most efficient path to reach to a row. Does distribution on clause is also play role while fetching rows ?

 

I tried this with one table but even though i use distribution column in where or not , it shows me sequential scan only in explain plan !!!!!!

 

Do we have anything like secondary indexes in netezza ?

 

What all factors we need to consider while deciding distribute on clause ?? like in teradata we consider below points:

 

Join access frequency

Value access frequency

Value access rows

Distinct rows

Max rows per value

Max rows null

Typical rows per value

Change rating.

 

Thanks in advance

Nehal

  • nickg Active Enzee 242 posts since
    Mar 16, 2011

    The DISTRIBUTE ON clause is identical to a Teradata Primary Index definition.  They both are used to determine which SPU (netezza) or AMP (teradata) holds the data.  Neither are actually indexes in the traditional sense of the word.

     

    Netezza does have any indexes.  It does have zone maps, which help reduce the number of disk blocks read, but they are not indexes.

     

    There are two things to consider when specifying distribution (RANDOM is the default distribution):

    1. Skew - how even (numbers of rows) is the distribution across spus.  Ideally, each spu should hold the same number of rows as other spus.  Differences of 10% between the smallest and highest population are often considered worst case and should not be exceeded.

    2. Co-located joins - if two tables are commonly joined and are large (depends on system, but usually 10GB or more) join query performance can be significantly improved by distributing both tables on the join columns so that a SPU broadcast is not necessary to join the two tables.  ALL distribute on columns must be used in the join, and the join must be an equi-join for a co-located join to occur, so usually the fewer distribute on columns specified the better, provided it results in a reasonably smooth distribution.  If the distribution choice causes a highly skewed distribution, it is often better off to leave the distribution random.

     

    In the case of smaller tables, particularly in dimensional models, random distibution usually works fine.  When implementing dimensional models (which work very well in Netezza, whereas they don't work well at all on Teradata) it is common to distribute the fact table randomly and the dimensions on their surrogate PK.  When you have a very large dimension, you may sometimes distribute the facts on that dimension's FK (or one of the FK's if there are multiple role based references to that dimension).  Again, it is important to be aware of data skew when doing so.

  • nickg Active Enzee 242 posts since
    Mar 16, 2011

    On Netezza it doesn't matter.  A query is always executed syncronously across all spus.

     

    Since spus operate syncronously, skew becomes important.  If one spu holds twice as much data as the other spus, it will take longest to complete its share of the work.  The total run time is tied to the longest running spu.

More Like This

  • Retrieving data ...

Bookmarked By (0)

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points