Dealing with table duplicate data

At times I come across a situation where there’s duplicate data in a table.  There can be many reasons for this but aside from this being intentional, duplicate data is a problem.

In this situation, the duplicates were caused by an issue with GoldenGate replication (we believe) in which data may have been loaded twice.  Because this table doesn’t have a unique index, there was no mechanism to prevent this.

Deletion of duplicates can be tedious, but it can also be quite quick (depending on your table size).  The idea behind this mechanism is that duplicates get identified and then ranked (first duplicate, second duplicate etc) – in this way, we use that ranking to determine what we delete.  By deleting all rows with a ranking of 2 or more, we delete all the duplicates.

It is essential that all columns in the table be declared for this, obviously.

Here’s the code.

DELETE
FROM SOE.ORDER_ITEMS
WHERE rowid IN
(SELECT rid
FROM
(SELECT rowid rid,
dense_rank() over (partition BY ORDER_ID , LINE_ITEM_ID, PRODUCT_ID , UNIT_PRICE, QUANTITY ,
DISPATCH_DATE, RETURN_DATE , GIFT_WRAP , CONDITION, SUPPLIER_ID NUMBER, ESTIMATED_DELIVERY
order by rowid) rownumber
FROM SOE.ORDER_ITEMS
)
WHERE rownumber>1
);

If it all looks good, issue a commit and go back to what you were doing..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s