Skip navigation
18601 Views 8 Replies Latest reply: Feb 6, 2010 12:58 AM by Superuser RSS
rmenon New Enzee 59 posts since
Apr 23, 2009
Currently Being Moderated

Feb 5, 2010 10:46 AM

most efficient way of deleting duplicates in netezza

Folks

For removing duplicate records from a table in Netezza, what is the most efficient way?

 

I am thinking something like:

 

 

delete from dupl

where rowid in

(

  select a.rid

    from

    (

      select rowid rid, row_number() over (partition by x,y,modification_ts order by rowid) rn

       from dupl

    ) a

   where rn != 1

);

Is this what people typically use or is there some other mechanism that is known to be more efficient?
Thanx!


  • Shawn Fox Enzee Exraordinaire 1,326 posts since
    Aug 15, 2006
    Currently Being Moderated
    1. Feb 5, 2010 10:56 AM (in response to rmenon)
    Re: most efficient way of deleting duplicates in netezza

    The easist way:

     

    create table nodups as select distinct * from table_with_dups;

    This may not work on really huge tables, but I've done it on a 1.5TB table on a 10400.  It won't work if you have timestamp columns or something which might be different between two rows.

  • Shawn Fox Enzee Exraordinaire 1,326 posts since
    Aug 15, 2006
    Currently Being Moderated
    3. Feb 5, 2010 11:01 AM (in response to rmenon)
    Re: most efficient way of deleting duplicates in netezza

    I meant it won't work if the timestamps are different or if you have any differences at all in the records.  Sometimes you might have duplicates based off your primary key, but different timestamps or some other column like that.

  • Jeff Feinsmith New Enzee 13 posts since
    Mar 28, 2007

    To avoid the problem that Shawn describes, and perhaps a slightly faster approach, try this:

     

    create new_table as select pk1, pk2, pk3, max(othercol1), max(othercol2), max(othercol3) from old_table group by pk1, pk2, pk3;

     

    This is basically does the same thing as the distinct approach, but you get to choose your key columns.  Everything else, including those pesky datetimes, gets rolled up into the max value.  It should also be a bit faster because you only need to perform distincting on a subset of the columns.

     

    -Jeff

     

    Message was edited by: Jeff Feinsmith

  • Mark Aukeman New Enzee 26 posts since
    Oct 14, 2008

    This approach eliminates duplicates, but you need to be careful that the MAXed columns truly make business sense, like an arrival timestamp.  If you MAX a code column or foreign key that varies within your grouped primary key columns, then the results may be inaccurate.  The dupes are gone, but the problem is "swept under the carpet", so to speak.  Here is a suggestion to avoid inaccurate results:  1) Confirm the true primary key, 2) ensure that all non-additive columns exist in the Group By except the time stamp  3)  If there are still duplicates relative to the primary key, then get business criteria for which of the dupicates to preserve and which can be eliminated..

  • Superuser Rookie 90 posts since
    Sep 19, 2008

    oh, lot of approaches...

    we use group by to achieve this

     

    first check whether dups are there in the table according to the columns which are not supposed to have dups ( excluding ur inserted/updated timestamp columns or something which differentiate ur dups)

    select key1,key2,key3 from old_table group by key1,key2,key3 having count(1)>1

    if yes, then

    create new_table as select * from old_table where 1=2

    insert into new_table(key1,key2,key3) select key1,key2,key3 from old_table group by key1,key2,key3

More Like This

  • Retrieving data ...

Bookmarked By (0)