Summary of methods for PostgreSQL to find and delete duplicate data

Create a sample table

We first create a sample table people and generate some data:

-- DROP TABLE IF EXISTS people;
CREATE TABLE people (
    id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name varchar(50) NOT NULL,
    email varchar(100) NOT NULL
);

INSERT INTO people(name, email)
VALUES ('Zhang San', 'zhangsan@'),
       ('Li Si', 'lisi@'),
       ('Wang Wu', 'wangwu@'),
       ('Li Si', 'lisi@'),
       ('Wang Wu', 'wangwu@'),
       ('Wang Wu', 'wangwu@');

SELECT * FROM people;
id|name|email             |
--|-----|-----------------|
 1|Zhang San  |zhangsan@|
 2|Li Si  |lisi@    |
 3|Wang Wu  |wangwu@  |
 4|Lis  |lisi@    |
 5|Wang Wu  |wangwu@  |
 6|Wang Wu  |wangwu@  |

Among them, the email fields of 2 and 4 have duplicate data; the name and email fields of 3, 5 and 6 have duplicate data.

At this point, if we want to create a unique constraint for the email field, an error will be returned:

ALTER TABLE people ADD CONSTRAINT uk_people_email UNIQUE (email);
SQL mistake [23505]: ERROR: could not create unique index "uk_people_email"
  detailed：Key (email)=(lisi@) is duplicated.

We must find out and delete duplicate records in the email field to create a unique constraint.

Find duplicate records using aggregate functions

Repeated records based on individual fields

If you want to find the duplicate data of email, you can group statistics based on this field and return a group with a number of rows greater than 1:

SELECT email, count(email)
FROM people
GROUP BY email
HAVING count(email) > 1;
email          |count|
---------------+-----+
wangwu@|    3|
lisi@  |    2|

The query results show that there are duplications in two email addresses. If you want to view the complete duplicate data, you can use a subquery or a link query:

SELECT *
FROM people
WHERE email IN (
      SELECT email
      FROM people
      GROUP BY email
      HAVING count(email) &gt; 1)
ORDER BY email;
id|name|email           |
--+----+----------------+
 2|Li Si  |lisi@  |
 4|Lis  |lisi@  |
 3|Wang Wu  |wangwu@|
 5|Wang Wu  |wangwu@|
 6|Wang Wu  |wangwu@|

WITH d AS (
  SELECT email
  FROM people
  GROUP BY email
  HAVING count(email) &gt; 1)
SELECT p.*
FROM people p
JOIN d ON ( = )
ORDER BY ;
id|name|email           |
--+----+----------------+
 2|Li Si  |lisi@  |
 4|Lis  |lisi@  |
 3|Wang Wu  |wangwu@|
 5|Wang Wu  |wangwu@|
 6|Wang Wu  |wangwu@|

Another way to find duplicate records is to use self-join query and distinct operator directly, for example:

SELECT DISTINCT p.*
FROM people p
JOIN people d ON  = 
WHERE  &lt;&gt; 
ORDER BY ;
id|name|email           |
--+----+----------------+
 2|Li Si  |lisi@  |
 4|Lis  |lisi@  |
 3|Wang Wu  |wangwu@|
 5|Wang Wu  |wangwu@|
 6|Wang Wu  |wangwu@|

Note that distinct cannot be omitted, otherwise some data (3, 5, 6) will be returned multiple times.

Repeated records based on multiple fields

If we want to find data in which both the name and email fields are duplicated, we can group statistics based on these two fields:

SELECT *
FROM people
WHERE (name, email) IN (
      SELECT name, email
      FROM people
      GROUP BY name, email
      HAVING count(1) &gt; 1)
ORDER BY email;
id|name|email           |
--+----+----------------+
 3|Wang Wu  |wangwu@|
 5|Wang Wu  |wangwu@|
 6|Wang Wu  |wangwu@|

WITH d AS (
  SELECT name, email
  FROM people
  GROUP BY name, email
  HAVING count(*) &gt; 1)
SELECT p.*
FROM people p
JOIN d ON ( =  AND  = )
ORDER BY ;
id|name|email          |
--+----+---------------+
 3|Wang Wu  |wangwu@|
 5|Wang Wu  |wangwu@|
 6|Wang Wu  |wangwu@|

Duplicate data is only duplicate if name and email are the same, so 2 and 4 are not duplicate records.

Find duplicate records using window functions

Using the aggregate function to find duplicate records requires scanning the same table twice. If the amount of data in the table is large, there may be performance problems. To do this, we can adopt another method: window function.

Repeated records based on individual fields

First, we use the count() window function to find out the number of times each email appears:

SELECT id, name, email, count(*) over (partition by email) cnt
FROM people;
id|name|email             |cnt|
--+----+------------------+---+
 2|Li Si  |lisi@    |  2|
 4|Lis  |lisi@    |  2|
 3|Wang Wu  |wangwu@  |  3|
 5|Wang Wu  |wangwu@  |  3|
 6|Wang Wu  |wangwu@  |  3|
 1|Zhang San  |zhangsan@|  1|

The window function can not only find the number of repeated fields, but also retain the data in the original table, avoiding the secondary scan operation. Next we only need to return records with a number of times greater than 1:

WITH d AS (
  SELECT id, name, email, 
         count(*) over (partition by email) cnt
  FROM people)
SELECT *
FROM d
WHERE cnt &gt; 1
ORDER BY id;
id|name|email           |cnt|
--+----+----------------+---+
 2|Li Si  |lisi@  |  2|
 3|Wang Wu  |wangwu@|  3|
 4|Lis  |lisi@  |  2|
 5|Wang Wu  |wangwu@|  3|
 6|Wang Wu  |wangwu@|  3|

Repeated records based on multiple fields

The window function also supports partitioning operations based on multiple fields. The following statements can be used to find data in which both the name and email fields are duplicated:

WITH d AS (
  SELECT id, name, email, 
         count(*) over (partition by name, email) cnt
  FROM people)
SELECT *
FROM d
WHERE cnt &gt; 1
ORDER BY id;
id|name|email           |cnt|
--+----+----------------+---+
 3|Wang Wu  |wangwu@|  3|
 5|Wang Wu  |wangwu@|  3|
 6|Wang Wu  |wangwu@|  3|

Obviously, window functions not only have better performance, but also have better readability.

Delete duplicate records

The problem that needs to be solved by writing it down is how to delete these duplicate records. Usually we need to keep one of them.

Use subquery to delete duplicate records

If we want to delete duplicate records in the email field, only the one with the smallest id is kept; you can use a subquery to find out the data that needs to be retained and then delete other data:

DELETE 
FROM people
WHERE id NOT IN (
      SELECT min(id)
      FROM people
      GROUP BY email
     );

After deletion, query the people table again:

SELECT * FROM people;
id|name|email             |
--+----+------------------+
 1|Zhang San  |zhangsan@|
 2|Li Si  |lisi@    |
 3|Wang Wu  |wangwu@  |

You can also use cross-table deletion or association subqueries to delete duplicate data:

DELETE
FROM people p
USING people d 
WHERE  =  AND  < ;

DELETE 
FROM people p
WHERE  NOT IN (
      SELECT min(id)
      FROM people
      WHERE email = 
     );

Before executing the above statement, remember to recreate the people table and generate the test data.

Think about it, How should I implement the data with the largest id in duplicate data?

Use window function to delete duplicate records

The ROW_NUMBER() window function can be used to group data and then assign a unique numeric number to each piece of data. For example:

SELECT id, name, email, 
       row_number() over (PARTITION BY email ORDER BY id) AS row_num 
FROM people;
ID|NAME|EMAIL             |ROW_NUM|
--|----|------------------|-------|
 2|Li Si  |lisi@    |      1|
 4|Lis  |lisi@    |      2|
 3|Wang Wu  |wangwu@  |      1|
 5|Wang Wu  |wangwu@  |      2|
 6|Wang Wu  |wangwu@  |      3|
 1|Zhang San  |zhangsan@|      1|

The above statement is based on email grouping (PARTITION BY email), and is sorted by id (ORDER BY id), and then assigns a number to the data in each group; if the number is greater than 1, it means that there is duplicate data.

In addition to ROW_NUMBER(), the RANK() or DENSE_RANK() functions can also implement the above functions. For introduction and use cases of window functions, you can refer to this article.

We can delete duplicate records based on the query result:

DELETE 
FROM people
WHERE id IN (
  SELECT id
  FROM (
      SELECT id, name, email, 
             row_number() over (PARTITION BY email ORDER BY id DESC) AS row_num 
      FROM people) d
  WHERE row_num > 1);

Before executing the above statement, remember to recreate the people table and generate the test data.

The deduplication method based on multiple fields is similar to a single field, and you can try it yourself. Finally, to prevent duplicate data from being generated again, a unique constraint can be added. For example:

ALTER TABLE people ADD CONSTRAINT uk_people_email UNIQUE (email);

Summarize

This article introduces how to find and delete duplicate records in PostgreSQL, including using aggregate functions plus subqueries or connection queries, using window functions and other methods to find duplicate data in a single field or multiple fields, and using subqueries and window functions to achieve deletion of duplicate data.

The above is the detailed summary of the method of PostgreSQL to find and delete duplicate data. For more information about PostgreSQL to find and delete duplicate data, please follow my other related articles!