I was just recently working with a client that was moving a few 100k rows of data on a daily basis from one system to another and all this data was incremental in the sense that the source data was a ticket system and there was an incrementing ID on all the sales records. Now the problem was that for various reasons not all data had been transferred in the past, a few sales records had been deleted because of void transactions (these were not so many), others were simply missing (this was a substantial bunch). All these missing data were amounting to quite a bit of money in the sales reports so I had to find a way to identify and transfer them on a regular basis so that this problem was kept at a bare minimum.
The tables that needed to be synced had in excess of 6 mill rows and even though it’s not huge it’s more than enough to cause problems if you write some bad code. SO! With some excellent help from all my geek friends over at http://www.sqlteam.com/forums I was able to come down with a script that did the trick…and mind you; it took less than 30 seconds to execute on the production box identifying over 500 gaps! Now since we are only talking 100k rows here the query will probably have sub-second execution time but it scales very well (the concept behind the query originally created by MVP Steve Kass, http://www.stevekass.com/)
-- Make sure insert runs a little faster
SET NOCOUNT ON
GO
-- Prepare a sample table
CREATE TABLE GapsTest (
ID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
Name varchar(50) NULL
)
GO
-- Insert 100k rows of dummy data
INSERT INTO GapsTest (Name) VALUES ('TheFirstSQL')
GO 100000
-- Delete some rows to create gaps in the identity column
DELETE GapsTest WHERE ID IN (263, 264, 37392, 44, 98712, 66219)
DELETE GapsTest WHERE ID BETWEEN 35000 AND 36000
GO
-- Run the "magic" query
;WITH
cte AS (
SELECT
ID,
RowNum = ROW_NUMBER() OVER (ORDER BY ID)
FROM dbo.GapsTest),
cte2 AS (
SELECT *, DENSE_RANK() OVER (ORDER BY ID - RowNum) As Series
FROM cte),
cte3 AS (
SELECT *, COUNT(*) OVER (PARTITION BY Series) AS SCount
FROM cte2),
cte4 AS (
SELECT
MinID = MIN(ID),
MaxID = MAX(ID),
Series
FROM cte3
GROUP BY Series)
SELECT GapStart = a.MaxID, GapEnd = b.MinID
FROM cte4 a
INNER JOIN cte4 b
ON a.Series+1 = b.Series
ORDER BY GapStart
Thanks for this interesting post. It was informative. Please check the alternate methods I have blogged relating to this @ http://vadivel.blogspot.com/2011/10/how-to-find-missing-gaps-sequence.html
;WITH cteSource(ID, Lead)
AS (
SELECT ID,
LEAD(ID) OVER (ORDER BY ID) AS Lead
FROM dbo.GapsTest
)
SELECT ID,
Lead
FROM cteSource
WHERE ID + 1 < Lead
WHERE ID < Lead – 1
2012! Haven’t got my hands on it yet unfortunately…