8 Replies Last post: Feb 6, 2010 12:37 AM by Superuser  
rmenon   32 posts since
Apr 23, 2009
Currently Being Moderated

Feb 5, 2010 10:46 AM

most efficient way of deleting duplicates in netezza

Folks

For removing duplicate records from a table in Netezza, what is the most efficient way?

 

I am thinking something like:

 

 

delete from dupl

where rowid in

(

  select a.rid

    from

    (

      select rowid rid, row_number() over (partition by x,y,modification_ts order by rowid) rn

       from dupl

    ) a

   where rn != 1

);

Is this what people typically use or is there some other mechanism that is known to be more efficient?
Thanx!


Shawn Fox   384 posts since
Aug 15, 2006
Currently Being Moderated
1. Feb 5, 2010 10:56 AM in response to: rmenon
Re: most efficient way of deleting duplicates in netezza

The easist way:

 

create table nodups as select distinct * from table_with_dups;

 

This may not work on really huge tables, but I've done it on a 1.5TB table on a 10400.  It won't work if you have timestamp columns or something which might be different between two rows.

Shawn Fox   384 posts since
Aug 15, 2006
Currently Being Moderated
3. Feb 5, 2010 11:01 AM in response to: rmenon
Re: most efficient way of deleting duplicates in netezza

I meant it won't work if the timestamps are different or if you have any differences at all in the records.  Sometimes you might have duplicates based off your primary key, but different timestamps or some other column like that.

Jeff Feinsmith   8 posts since
Mar 28, 2007
Currently Being Moderated
4. Feb 5, 2010 1:15 PM in response to: Shawn Fox
Re: most efficient way of deleting duplicates in netezza

To avoid the problem that Shawn describes, and perhaps a slightly faster approach, try this:

 

create new_table as select pk1, pk2, pk3, max(othercol1), max(othercol2), max(othercol3) from old_table group by pk1, pk2, pk3;

 

This is basically does the same thing as the distinct approach, but you get to choose your key columns.  Everything else, including those pesky datetimes, gets rolled up into the max value.  It should also be a bit faster because you only need to perform distincting on a subset of the columns.

 

-Jeff

 

Message was edited by: Jeff Feinsmith

Mark Aukeman   21 posts since
Oct 14, 2008
Currently Being Moderated
7. Feb 5, 2010 12:47 PM in response to: Jeff Feinsmith
Re: most efficient way of deleting duplicates in netezza

This approach eliminates duplicates, but you need to be careful that the MAXed columns truly make business sense, like an arrival timestamp.  If you MAX a code column or foreign key that varies within your grouped primary key columns, then the results may be inaccurate.  The dupes are gone, but the problem is "swept under the carpet", so to speak.  Here is a suggestion to avoid inaccurate results:  1) Confirm the true primary key, 2) ensure that all non-additive columns exist in the Group By except the time stamp  3)  If there are still duplicates relative to the primary key, then get business criteria for which of the dupicates to preserve and which can be eliminated..

Superuser   82 posts since
Sep 19, 2008
Currently Being Moderated
8. Feb 6, 2010 12:58 AM in response to: Mark Aukeman
Re: most efficient way of deleting duplicates in netezza

oh, lot of approaches...

we use group by to achieve this

 

first check whether dups are there in the table according to the columns which are not supposed to have dups ( excluding ur inserted/updated timestamp columns or something which differentiate ur dups)

select key1,key2,key3 from old_table group by key1,key2,key3 having count(1)>1

if yes, then

create new_table as select * from old_table where 1=2

insert into new_table(key1,key2,key3) select key1,key2,key3 from old_table group by key1,key2,key3

More Like This

  • Retrieving data ...