During the period of using the PG database, three methods of deleting duplicate data were summarized. The easiest thing to think of is the most common deletion method. However, this method has poor performance and takes a long time to delete data. Although it is easy to implement, its performance is too poor, which affects the speed of writing data.
In addition, the group by deletion method used is highly efficient.
There is another one that I just discovered and has not been verified yet. Now I will summarize these three deletion methods and verify their respective execution efficiency.
First create a basic table and insert a certain amount of duplicate data.
test=# create table deltest(id int, name varchar(255)); CREATE TABLE test=# create table deltest_bk (like deltest); CREATE TABLE test=# insert into deltest select generate_series(1, 10000), 'ZhangSan'; INSERT 0 10000 test=# insert into deltest select generate_series(1, 10000), 'ZhangSan'; INSERT 0 10000 test=# insert into deltest_bk select * from deltest;
General deletion method
The easiest way to think of is to judge whether the data is duplicated. For duplicate data, only the data with the smallest (or largest) ctid is retained and other data is deleted.
test=# explain analyse delete from deltest a where <> (select min() from deltest t where =); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Delete on deltest a (cost=0.00..195616.30 rows=1518 width=6) (actual time=67758.866..67758.866 rows=0 loops=1) -> Seq Scan on deltest a (cost=0.00..195616.30 rows=1518 width=6) (actual time=32896.517..67663.228 rows=10000 loops=1) Filter: (ctid <> (SubPlan 1)) Rows Removed by Filter: 10000 SubPlan 1 -> Aggregate (cost=128.10..128.10 rows=1 width=6) (actual time=3.374..3.374 rows=1 loops=20000) -> Seq Scan on deltest t (cost=0.00..128.07 rows=8 width=6) (actual time=0.831..3.344 rows=2 loops=20000) Filter: ( = id) Rows Removed by Filter: 19998 Total runtime: 67758.931 ms test=# select count(*) from deltest; count ------- 10000 (1 Line records)
You can see that the data with the same id is retained, and the other ones with the smallest ctid are deleted. It is equivalent to deleting half of the data in the deltest table, which takes more than 67s. Quite slow.
group by deletion method
The second method is the group by method, which finds the smallest data of ctid through grouping, and then deletes other data.
test=# truncate table deltest; TRUNCATE TABLE test=# insert into deltest select * from deltest_bk; INSERT 0 20000 test=# explain analyse delete from deltest a where not in (select min(ctid) from deltest group by id); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Delete on deltest a (cost=131.89..2930.46 rows=763 width=6) (actual time=30942.496..30942.496 rows=0 loops=1) -> Seq Scan on deltest a (cost=131.89..2930.46 rows=763 width=6) (actual time=10186.296..30814.366 rows=10000 loops=1) Filter: (NOT (SubPlan 1)) Rows Removed by Filter: 10000 SubPlan 1 -> Materialize (cost=131.89..134.89 rows=200 width=10) (actual time=0.001..0.471 rows=7500 loops=20000) -> HashAggregate (cost=131.89..133.89 rows=200 width=10) (actual time=10.568..13.584 rows=10000 loops=1) -> Seq Scan on deltest (cost=0.00..124.26 rows=1526 width=10) (actual time=0.006..3.829 rows=20000 loops=1) Total runtime: 30942.819 ms (9 Line records) test=# select count(*) from deltest; count ------- 10000 (1 Line records)
You can see that half of the data is also deleted, and using group by, saving half of the time. But it still needs 30 seconds. Let’s try the third deletion operation below.
New deletion method
In the book Postgres Cultivation Way, the author mentioned a more efficient method of deletion. Let's verify it here, as follows:
test=# truncate table deltest; TRUNCATE TABLE test=# insert into deltest select * from deltest_bk; INSERT 0 20000 test=# explain analyze delete from deltest a where = any(array (select ctid from (select row_number() over (partition by id), ctid from deltest) t where t.row_number > 1)); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Delete on deltest a (cost=250.74..270.84 rows=10 width=6) (actual time=98.363..98.363 rows=0 loops=1) InitPlan 1 (returns $0) -> Subquery Scan on t (cost=204.95..250.73 rows=509 width=6) (actual time=29.446..47.867 rows=10000 loops=1) Filter: (t.row_number > 1) Rows Removed by Filter: 10000 -> WindowAgg (cost=204.95..231.66 rows=1526 width=10) (actual time=29.436..44.790 rows=20000 loops=1) -> Sort (cost=204.95..208.77 rows=1526 width=10) (actual time=12.466..13.754 rows=20000 loops=1) Sort Key: Sort Method: quicksort Memory: 1294kB -> Seq Scan on deltest (cost=0.00..124.26 rows=1526 width=10) (actual time=0.021..5.110 rows=20000 loops=1) -> Tid Scan on deltest a (cost=0.01..20.11 rows=10 width=6) (actual time=82.983..88.751 rows=10000 loops=1) TID Cond: (ctid = ANY ($0)) Total runtime: 98.912 ms (13 Line records) test=# select count(*) from deltest; count ------- 10000 (1 Line records)
Seeing the above results really surprised me. This is the first time I have seen such a quick deletion method. I am truly ignorant and I want to worship the great author of the book "The Way of Practice".
Supplement: pgsql deletes duplicate data in the table to keep one of them
1. Add a field rownum to the table (table name: table primary key:id) to the table, type serial
2. Execute statement:
delete from table where rownum not in( select max(rownum) from table group by id )
3. Finally delete rownum
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.