Everyday SQL - How to delete duplicated rows
Today we are going to delete the duplicated rows in the database.
We have a table to store the N-N relationship between order and Product, which contains three columns: Id
for PK, OrderId
and ProductId
. There is no unique constraint on column combination: [OrderId
, ProductId
]. The data is as follow:
WITH T(Id, OrderId, ProductId)
AS (SELECT 1,1,1 UNION ALL
SELECT 2,1,1 UNION ALL
SELECT 3,1,1 UNION ALL
SELECT 4,1,2)
SELECT * from T
If you are not familiar with the WITH
1 clause, it doesn’t matter. You can consider that there is a table T
contains the above data.
The data looks like:
Id | OrderId | ProductId |
---|---|---|
1 | 1 | 1 |
2 | 1 | 1 |
3 | 1 | 1 |
4 | 1 | 2 |
if we consider value pair (OrderId, ProductId) as unique, then Row 1, 2 and 3 are duplicated rows, we need to delete either two of them.
There are several functions for ranking in SQL Server: ROW_NUMBER, RANK, DENSE_RANK, NTILE. Let’s find out which one to use.
Which to use? ROW_NUMBER, RANK, or DENSE_RANK?
The syntaxes of these three are similiar:
ROW_NUMBER() | DENSE_RANK() | RANK()
OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause )
** The PARTITION BY
clause will decide what rows will be grouped into same partition, and the order_by_clause
will affect the number(row number or rank) for each row in the same partition. For the number, the ROW_NUMBER() always returns sequential numbers(for example 1, 2, 3, 4, 5), RANK() returns the same numeric value for ties(for example 1, 2, 2, 4, 5), while DENSE_RANK() returns numberic values without gaps(for example 1, 2, 2, 3, 4).
That’s the different between those three functions.
Use ROW_NUMBER() to identify duplication
In our case, [OrderId, ProductId] is used for duplication check, and we want to keep the id with max value in each partition. So our sql will look like:
WITH T(Id, OrderId, ProductId)
AS (SELECT 1,1,1 UNION ALL
SELECT 2,1,1 UNION ALL
SELECT 3,1,1 UNION ALL
SELECT 4,1,2)
SELECT *,
ROW_NUMBER() OVER(PARTITION BY OrderId, ProductId ORDER BY Id DESC) AS 'ROW_NUMBER'
FROM T
The result is:
Id | OrderId | ProductId | ROW_NUMBER |
---|---|---|---|
3 | 1 | 1 | 1 |
2 | 1 | 1 | 2 |
1 | 1 | 1 | 3 |
4 | 1 | 2 | 1 |
In the result we can see the first three rows are with row number 1, 2, 3, and Id with 3 has the lowest row number 1. Then we can delete all rows the row number of which is not 1.
Try RANK() and DENSE_RANK()
Use RANK
or DENSE_RANK
will return the same result, the reason is the Id column in ORDER BY
clause is unique, so the rank for each row is different. Let’s see the result:
WITH T(Id, OrderId, ProductId)
AS (SELECT 1,1,1 UNION ALL
SELECT 2,1,1 UNION ALL
SELECT 3,1,1 UNION ALL
SELECT 4,1,2)
SELECT *,
ROW_NUMBER() OVER(PARTITION BY OrderId, ProductId ORDER BY Id DESC) AS 'ROW_NUMBER',
RANK() OVER(PARTITION BY OrderId, ProductId ORDER BY Id DESC) AS 'RANK',
DENSE_RANK() OVER(PARTITION BY OrderId, ProductId ORDER BY Id DESC) AS 'DENSE_RANK'
FROM T
The result is the same as ROW_NUMBER:
Id | OrderId | ProductId | ROW_NUMBER | RANK | DENSE_RANK |
---|---|---|---|---|---|
3 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 2 | 2 | 2 |
1 | 1 | 1 | 3 | 3 | 3 |
4 | 1 | 2 | 1 | 1 | 1 |
If we change the ORDER BY
clause, the RANK
and DENSE_RANK
will return different results. Here we use Id % 2
in the ORDER BY
clause to return 0 for even Id and 1 for odd Id. See how it affects the rank.
WITH T(Id, OrderId, ProductId)
AS (SELECT 1,1,1 UNION ALL
SELECT 2,1,1 UNION ALL
SELECT 3,1,1 UNION ALL
SELECT 4,1,2)
SELECT *,
RANK() OVER(PARTITION BY OrderId, ProductId ORDER BY Id%2 DESC) AS 'RANK',
DENSE_RANK() OVER(PARTITION BY OrderId, ProductId ORDER BY Id%2 DESC) AS 'DENSE_RANK'
FROM T
Id | OrderId | ProductId | RANK | DENSE_RANK |
---|---|---|---|---|
3 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 3 | 2 |
4 | 1 | 2 | 1 | 1 |
-
To keep it simple, it’s not allowed to have more than 1
SELECT
clause followed. And it’s just a query, we will have it in each code snippet. ↩
blog comments powered by Disqus