Folks
For removing duplicate records from a table in Netezza, what is the most efficient way?
I am thinking something like:
delete from dupl
where rowid in
(
select a.rid
from
(
select rowid rid, row_number() over (partition by x,y,modification_ts order by rowid) rn
from dupl
) a
where rn != 1
);
The easist way:
create table nodups as select distinct * from table_with_dups;
This may not work on really huge tables, but I've done it on a 1.5TB table on a 10400. It won't work if you have timestamp columns or something which might be different between two rows.
Thanx - yeah a quick test reveals it to be quite a bit faster than the delete I proposed.
"It won't work if you have timestamp columns or something which might be different between two rows"
Did not quite understand the above statement => why wont it work when there are timestamps?
I meant it won't work if the timestamps are different or if you have any differences at all in the records. Sometimes you might have duplicates based off your primary key, but different timestamps or some other column like that.
To avoid the problem that Shawn describes, and perhaps a slightly faster approach, try this:
create new_table as select pk1, pk2, pk3, max(othercol1), max(othercol2), max(othercol3) from old_table group by pk1, pk2, pk3;
This is basically does the same thing as the distinct approach, but you get to choose your key columns. Everything else, including those pesky datetimes, gets rolled up into the max value. It should also be a bit faster because you only need to perform distincting on a subset of the columns.
-Jeff
Message was edited by: Jeff Feinsmith
"I meant it won't work if the timestamps are different or if you have any differences at all in the records. Sometimes you might have duplicates based off your primary key, but different timestamps or some other column like that."
I think you mean if the pk columns are the same and the time stamps (or some other column) are different somehow and if you want to consider such rows as duplicates, correct? If so, it makes sense to me.
Thanx Jeff,
Interesting. That approach makes sense if you want to choose your "unique key" columns.
This approach eliminates duplicates, but you need to be careful that the MAXed columns truly make business sense, like an arrival timestamp. If you MAX a code column or foreign key that varies within your grouped primary key columns, then the results may be inaccurate. The dupes are gone, but the problem is "swept under the carpet", so to speak. Here is a suggestion to avoid inaccurate results: 1) Confirm the true primary key, 2) ensure that all non-additive columns exist in the Group By except the time stamp 3) If there are still duplicates relative to the primary key, then get business criteria for which of the dupicates to preserve and which can be eliminated..
oh, lot of approaches...
we use group by to achieve this
first check whether dups are there in the table according to the columns which are not supposed to have dups ( excluding ur inserted/updated timestamp columns or something which differentiate ur dups)
select key1,key2,key3 from old_table group by key1,key2,key3 having count(1)>1
if yes, then
create new_table as select * from old_table where 1=2
insert into new_table(key1,key2,key3) select key1,key2,key3 from old_table group by key1,key2,key3