When tables run to tens or even hundreds of millions of rows, removing duplicate entries is no longer a trivial effort. For smaller tables of say, less than 1 million rows, this works great –
SELECT * INTO #temp FROM [dbo].[goodSizedTable] TRUNCATE TABLE [dbo].[goodSizedTable] INSERT INTO [dbo].[goodSizedTable] SELECT DISTINCT * FROM #temp DROP TABLE #temp
This doesn’t scale too well with humongous tables and limited server RAM or undersized volumes for tempdb. If you’re unfortunate enough to have tempdb hosted on spinning disk (as opposed to SSD), you’ll really notice I/O problems (Buffer Waits).
When you start seeing timeout errors with SqlCmd.ExecuteNonQuery() you’ll need to set SqlCmd.CommandTimeout to something much larger than the default 30 seconds. This might ‘keep the wheels on’ but doesn’t feel like a solution.
Begin by counting your duplicates.
with myCTE AS( SELECT [Column1], ROW_NUMBER() OVER (PARTITION BY [Column1], [Column2], [Column3] ORDER BY [Column1]) AS [Row] FROM [dbo].[humongousTable] ) SELECT * FROM myCTE WHERE [Row] > 1
If the number is non-trivial then the pattern for your solution lies in there. Though not rigorously tested, this seems to run an order of magnitude faster.
DELETE [dups] FROM (SELECT ROW_NUMBER() OVER (PARTITION BY [Column1], [Column2], [Column3] ORDER BY [Column1]) AS [SeqID] FROM [dbo].[humongousTable]) AS [dups] WHERE [SeqID] > 1