De-duplicating Massive Tables

Challenge

When tables run to tens or even hundreds of millions of rows, removing duplicate entries is no longer a trivial effort. For smaller tables of say, less than 1 million rows, this works great –

SELECT * INTO #temp FROM [dbo].[goodSizedTable]
TRUNCATE TABLE [dbo].[goodSizedTable]
INSERT INTO [dbo].[goodSizedTable]
SELECT DISTINCT * FROM #temp
DROP TABLE #temp

This doesn’t scale too well with humongous tables and limited server RAM or undersized volumes for tempdb. If you’re unfortunate enough to have tempdb hosted on spinning disk (as opposed to SSD), you’ll really notice I/O problems (Buffer Waits).

When you start seeing timeout errors with SqlCmd.ExecuteNonQuery() you’ll need to set SqlCmd.CommandTimeout to something much larger than the default 30 seconds. This might ‘keep the wheels on’ but doesn’t feel like a solution.

Solution

Begin by counting your duplicates.

with myCTE AS(
SELECT [Column1], ROW_NUMBER() OVER (PARTITION BY [Column1], [Column2], [Column3] ORDER BY [Column1]) AS [Row]
FROM [dbo].[humongousTable]
)
SELECT * FROM myCTE WHERE [Row] > 1

If the number is non-trivial then the pattern for your solution lies in there. Though not rigorously tested, this seems to run an order of magnitude faster.

DELETE [dups]
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY [Column1], [Column2], [Column3] ORDER BY [Column1]) AS [SeqID] FROM [dbo].[humongousTable]) AS [dups]
WHERE [SeqID] > 1
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s