Create a sample table
We first create a sample table people and generate some data:
-- DROP TABLE IF EXISTS people; CREATE TABLE people ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, name varchar(50) NOT NULL, email varchar(100) NOT NULL ); INSERT INTO people(name, email) VALUES ('Zhang San', 'zhangsan@'), ('Li Si', 'lisi@'), ('Wang Wu', 'wangwu@'), ('Li Si', 'lisi@'), ('Wang Wu', 'wangwu@'), ('Wang Wu', 'wangwu@'); SELECT * FROM people; id|name|email | --|-----|-----------------| 1|Zhang San |zhangsan@| 2|Li Si |lisi@ | 3|Wang Wu |wangwu@ | 4|Lis |lisi@ | 5|Wang Wu |wangwu@ | 6|Wang Wu |wangwu@ |
Among them, the email fields of 2 and 4 have duplicate data; the name and email fields of 3, 5 and 6 have duplicate data.
At this point, if we want to create a unique constraint for the email field, an error will be returned:
ALTER TABLE people ADD CONSTRAINT uk_people_email UNIQUE (email); SQL mistake [23505]: ERROR: could not create unique index "uk_people_email" detailed:Key (email)=(lisi@) is duplicated.
We must find out and delete duplicate records in the email field to create a unique constraint.
Find duplicate records using aggregate functions
Repeated records based on individual fields
If you want to find the duplicate data of email, you can group statistics based on this field and return a group with a number of rows greater than 1:
SELECT email, count(email) FROM people GROUP BY email HAVING count(email) > 1; email |count| ---------------+-----+ wangwu@| 3| lisi@ | 2|
The query results show that there are duplications in two email addresses. If you want to view the complete duplicate data, you can use a subquery or a link query:
SELECT * FROM people WHERE email IN ( SELECT email FROM people GROUP BY email HAVING count(email) > 1) ORDER BY email; id|name|email | --+----+----------------+ 2|Li Si |lisi@ | 4|Lis |lisi@ | 3|Wang Wu |wangwu@| 5|Wang Wu |wangwu@| 6|Wang Wu |wangwu@| WITH d AS ( SELECT email FROM people GROUP BY email HAVING count(email) > 1) SELECT p.* FROM people p JOIN d ON ( = ) ORDER BY ; id|name|email | --+----+----------------+ 2|Li Si |lisi@ | 4|Lis |lisi@ | 3|Wang Wu |wangwu@| 5|Wang Wu |wangwu@| 6|Wang Wu |wangwu@|
Another way to find duplicate records is to use self-join query and distinct operator directly, for example:
SELECT DISTINCT p.* FROM people p JOIN people d ON = WHERE <> ORDER BY ; id|name|email | --+----+----------------+ 2|Li Si |lisi@ | 4|Lis |lisi@ | 3|Wang Wu |wangwu@| 5|Wang Wu |wangwu@| 6|Wang Wu |wangwu@|
Note that distinct cannot be omitted, otherwise some data (3, 5, 6) will be returned multiple times.
Repeated records based on multiple fields
If we want to find data in which both the name and email fields are duplicated, we can group statistics based on these two fields:
SELECT * FROM people WHERE (name, email) IN ( SELECT name, email FROM people GROUP BY name, email HAVING count(1) > 1) ORDER BY email; id|name|email | --+----+----------------+ 3|Wang Wu |wangwu@| 5|Wang Wu |wangwu@| 6|Wang Wu |wangwu@| WITH d AS ( SELECT name, email FROM people GROUP BY name, email HAVING count(*) > 1) SELECT p.* FROM people p JOIN d ON ( = AND = ) ORDER BY ; id|name|email | --+----+---------------+ 3|Wang Wu |wangwu@| 5|Wang Wu |wangwu@| 6|Wang Wu |wangwu@|
Duplicate data is only duplicate if name and email are the same, so 2 and 4 are not duplicate records.
Find duplicate records using window functions
Using the aggregate function to find duplicate records requires scanning the same table twice. If the amount of data in the table is large, there may be performance problems. To do this, we can adopt another method: window function.
Repeated records based on individual fields
First, we use the count() window function to find out the number of times each email appears:
SELECT id, name, email, count(*) over (partition by email) cnt FROM people; id|name|email |cnt| --+----+------------------+---+ 2|Li Si |lisi@ | 2| 4|Lis |lisi@ | 2| 3|Wang Wu |wangwu@ | 3| 5|Wang Wu |wangwu@ | 3| 6|Wang Wu |wangwu@ | 3| 1|Zhang San |zhangsan@| 1|
The window function can not only find the number of repeated fields, but also retain the data in the original table, avoiding the secondary scan operation. Next we only need to return records with a number of times greater than 1:
WITH d AS ( SELECT id, name, email, count(*) over (partition by email) cnt FROM people) SELECT * FROM d WHERE cnt > 1 ORDER BY id; id|name|email |cnt| --+----+----------------+---+ 2|Li Si |lisi@ | 2| 3|Wang Wu |wangwu@| 3| 4|Lis |lisi@ | 2| 5|Wang Wu |wangwu@| 3| 6|Wang Wu |wangwu@| 3|
Repeated records based on multiple fields
The window function also supports partitioning operations based on multiple fields. The following statements can be used to find data in which both the name and email fields are duplicated:
WITH d AS ( SELECT id, name, email, count(*) over (partition by name, email) cnt FROM people) SELECT * FROM d WHERE cnt > 1 ORDER BY id; id|name|email |cnt| --+----+----------------+---+ 3|Wang Wu |wangwu@| 3| 5|Wang Wu |wangwu@| 3| 6|Wang Wu |wangwu@| 3|
Obviously, window functions not only have better performance, but also have better readability.
Delete duplicate records
The problem that needs to be solved by writing it down is how to delete these duplicate records. Usually we need to keep one of them.
Use subquery to delete duplicate records
If we want to delete duplicate records in the email field, only the one with the smallest id is kept; you can use a subquery to find out the data that needs to be retained and then delete other data:
DELETE FROM people WHERE id NOT IN ( SELECT min(id) FROM people GROUP BY email );
After deletion, query the people table again:
SELECT * FROM people; id|name|email | --+----+------------------+ 1|Zhang San |zhangsan@| 2|Li Si |lisi@ | 3|Wang Wu |wangwu@ |
You can also use cross-table deletion or association subqueries to delete duplicate data:
DELETE FROM people p USING people d WHERE = AND < ; DELETE FROM people p WHERE NOT IN ( SELECT min(id) FROM people WHERE email = );
Before executing the above statement, remember to recreate the people table and generate the test data.
Think about it, How should I implement the data with the largest id in duplicate data?
Use window function to delete duplicate records
The ROW_NUMBER() window function can be used to group data and then assign a unique numeric number to each piece of data. For example:
SELECT id, name, email, row_number() over (PARTITION BY email ORDER BY id) AS row_num FROM people; ID|NAME|EMAIL |ROW_NUM| --|----|------------------|-------| 2|Li Si |lisi@ | 1| 4|Lis |lisi@ | 2| 3|Wang Wu |wangwu@ | 1| 5|Wang Wu |wangwu@ | 2| 6|Wang Wu |wangwu@ | 3| 1|Zhang San |zhangsan@| 1|
The above statement is based on email grouping (PARTITION BY email), and is sorted by id (ORDER BY id), and then assigns a number to the data in each group; if the number is greater than 1, it means that there is duplicate data.
In addition to ROW_NUMBER(), the RANK() or DENSE_RANK() functions can also implement the above functions. For introduction and use cases of window functions, you can refer to this article.
We can delete duplicate records based on the query result:
DELETE FROM people WHERE id IN ( SELECT id FROM ( SELECT id, name, email, row_number() over (PARTITION BY email ORDER BY id DESC) AS row_num FROM people) d WHERE row_num > 1);
Before executing the above statement, remember to recreate the people table and generate the test data.
The deduplication method based on multiple fields is similar to a single field, and you can try it yourself. Finally, to prevent duplicate data from being generated again, a unique constraint can be added. For example:
ALTER TABLE people ADD CONSTRAINT uk_people_email UNIQUE (email);
Summarize
This article introduces how to find and delete duplicate records in PostgreSQL, including using aggregate functions plus subqueries or connection queries, using window functions and other methods to find duplicate data in a single field or multiple fields, and using subqueries and window functions to achieve deletion of duplicate data.
The above is the detailed summary of the method of PostgreSQL to find and delete duplicate data. For more information about PostgreSQL to find and delete duplicate data, please follow my other related articles!