Postgresql finds rows of duplicate data in table and deletes
Create a test table and insert data
create table aaa(id bigserial,col1 varchar(255));
insert into aaa values(1,'b'),(2,'a'),(3,'b'),(4,'c');
select * from aaa;
Find duplicate lines and delete
Method 1: ctid represents the physical position of the data row in the table it is located. ctid consists of two numbers. The first number represents the physical block number and the second number represents the row number in the physical block.
select * from aaa where ctid not in(select max(ctid) from aaa group by col1);
Delete duplicate lines
delete from aaa where ctid not in(select max(ctid) from aaa group by col1);
Method 2: Use exists
Find the duplicate line
select * from aaa t1 where exists (select 1 from aaa t2 where t1.col1=t2.col1 and < )----existsThe meaning of the same column is equal,But it increasesidNot equal andidThe smaller one
Delete duplicate lines
delete from aaa t1 where exists (select 1 from aaa t2 where t1.col1=t2.col1 and < )
Commonly used methods for deleting duplicate data in postgresql
The most efficient method
Test environment verification, 66 million rows of large tables, and 22 million duplicate data are deleted in just 3 minutes
delete from deltest a where = any(array (select ctid from (select row_number() over (partition by id), ctid from deltest) t where t.row_number > 1));
Three ways to delete duplicate data in PG
First create a basic table and insert a certain amount of duplicate data.
create table deltest(id int, name varchar(255)); create table deltest_bk (like deltest); insert into deltest select generate_series(1, 10000), 'ZhangSan'; insert into deltest select generate_series(1, 10000), 'ZhangSan'; insert into deltest_bk select * from deltest;
1. General deletion method
The easiest way to think of is to judge whether the data is duplicated. For duplicate data, only the data with the smallest (or largest) ctid is retained and other ones are deleted.
explain analyse delete from deltest a where <> (select min() from deltest t where =); ------------------------------------------------------------------------------------------- Delete on deltest a (cost=0.00..195616.30 rows=1518 width=6) (actual time=67758.866..67758.866 rows=0 loops=1) -> Seq Scan on deltest a (cost=0.00..195616.30 rows=1518 width=6) (actual time=32896.517..67663.228 rows=10000 loops=1) Filter: (ctid <> (SubPlan 1)) Rows Removed by Filter: 10000 SubPlan 1 -> Aggregate (cost=128.10..128.10 rows=1 width=6) (actual time=3.374..3.374 rows=1 loops=20000) -> Seq Scan on deltest t (cost=0.00..128.07 rows=8 width=6) (actual time=0.831..3.344 rows=2 loops=20000) Filter: ( = id) Rows Removed by Filter: 19998 Total runtime: 67758.931 ms select count(*) from deltest; count ------- 10000
You can see that the data with the same id is kept with the smallest ctid and other deletes. It is equivalent to deleting half of the data in the deltest table, which takes more than 67s. Quite slow.
2. group by deletion method
The group by method finds the smallest data of ctid through grouping, and then deletes other data.
explain analyse delete from deltest a where not in (select min(ctid) from deltest group by id); ------------------------------------------------------------------------------------------- Delete on deltest a (cost=131.89..2930.46 rows=763 width=6) (actual time=30942.496..30942.496 rows=0 loops=1) -> Seq Scan on deltest a (cost=131.89..2930.46 rows=763 width=6) (actual time=10186.296..30814.366 rows=10000 loops=1) Filter: (NOT (SubPlan 1)) Rows Removed by Filter: 10000 SubPlan 1 -> Materialize (cost=131.89..134.89 rows=200 width=10) (actual time=0.001..0.471 rows=7500 loops=20000) -> HashAggregate (cost=131.89..133.89 rows=200 width=10) (actual time=10.568..13.584 rows=10000 loops=1) -> Seq Scan on deltest (cost=0.00..124.26 rows=1526 width=10) (actual time=0.006..3.829 rows=20000 loops=1) Total runtime: 30942.819 ms select count(*) from deltest; count ------- 10000
You can see that half of the data is also deleted, and using group by, saving half of the time. But it still contains 30 seconds. Let’s try the third deletion operation below.
3. Efficient deletion method
explain analyze delete from deltest a where = any(array (select ctid from (select row_number() over (partition by id), ctid from deltest) t where t.row_number > 1)); ----------------------------------------------------------------------------------------- Delete on deltest a (cost=250.74..270.84 rows=10 width=6) (actual time=98.363..98.363 rows=0 loops=1) InitPlan 1 (returns 0)−>SubqueryScanont(cost=204.95..250.73rows=509width=6)(actualtime=29.446..47.867rows=10000loops=1)Filter:(>1)RowsRemovedbyFilter:10000−>WindowAgg(cost=204.95..231.66rows=1526width=10)(actualtime=29.436..44.790rows=20000loops=1)−>Sort(cost=204.95..208.77rows=1526width=10)(actualtime=12.466..13.754rows=20000loops=1)SortKey::quicksortMemory:1294kB−>SeqScanondeltest(cost=0.00..124.26rows=1526width=10)(actualtime=0.021..5.110rows=20000loops=1)−>TidScanondeltesta(cost=0.01..20.11rows=10width=6)(actualtime=82.983..88.751rows=10000loops=1)TIDCond:(ctid=ANY(0)−>SubqueryScanont(cost=204.95..250.73rows=509width=6)(actualtime=29.446..47.867rows=10000loops=1)Filter:(>1)RowsRemovedbyFilter:10000−>WindowAgg(cost=204.95..231.66rows=1526width=10)(actualtime=29.436..44.790rows=20000loops=1)−>Sort(cost=204.95..208.77rows=1526width=10)(actualtime=12.466..13.754rows=20000loops=1)SortKey::quicksortMemory:1294kB−>SeqScanondeltest(cost=0.00..124.26rows=1526width=10)(actualtime=0.021..5.110rows=20000loops=1)−>TidScanondeltesta(cost=0.01..20.11rows=10width=6)(actualtime=82.983..88.751rows=10000loops=1)TIDCond:(ctid=ANY(0)) Total runtime: 98.912 ms select count(*) from deltest; count ------- 10000
You can see that it only costs 98ms
Summarize
The above is personal experience. I hope you can give you a reference and I hope you can support me more.