How to find rows of duplicate data in table and delete them

Postgresql finds rows of duplicate data in table and deletes

Create a test table and insert data

create table aaa(id bigserial,col1 varchar(255));

insert into aaa values(1,'b'),(2,'a'),(3,'b'),(4,'c');

select * from aaa;

Find duplicate lines and delete

Method 1: ctid represents the physical position of the data row in the table it is located. ctid consists of two numbers. The first number represents the physical block number and the second number represents the row number in the physical block.

select * from aaa where ctid not in(select max(ctid) from aaa group by col1);

Delete duplicate lines

delete from aaa where ctid not in(select max(ctid) from aaa group by col1);

Method 2: Use exists

Find the duplicate line

select * from aaa t1 where  exists (select 1 from aaa t2 where t1.col1=t2.col1 and &lt; )----existsThe meaning of the same column is equal，But it increasesidNot equal andidThe smaller one

Delete duplicate lines

delete from aaa t1 where  exists (select 1 from aaa t2 where t1.col1=t2.col1 and < )

Commonly used methods for deleting duplicate data in postgresql

The most efficient method

Test environment verification, 66 million rows of large tables, and 22 million duplicate data are deleted in just 3 minutes

delete from deltest a where  = any(array (select ctid from (select row_number() over (partition by id), ctid from deltest) t where t.row_number > 1));

Three ways to delete duplicate data in PG

First create a basic table and insert a certain amount of duplicate data.

create table deltest(id int, name varchar(255));
create table deltest_bk (like deltest);
insert into deltest select generate_series(1, 10000), 'ZhangSan';
insert into deltest select generate_series(1, 10000), 'ZhangSan';
insert into deltest_bk select * from deltest;

1. General deletion method

The easiest way to think of is to judge whether the data is duplicated. For duplicate data, only the data with the smallest (or largest) ctid is retained and other ones are deleted.

explain analyse delete from deltest a where  <> (select min() from deltest t where =);
-------------------------------------------------------------------------------------------
    Delete on deltest a  (cost=0.00..195616.30 rows=1518 width=6) (actual time=67758.866..67758.866 rows=0 loops=1)
       ->  Seq Scan on deltest a  (cost=0.00..195616.30 rows=1518 width=6) (actual time=32896.517..67663.228 rows=10000 loops=1)
         Filter: (ctid <> (SubPlan 1))
         Rows Removed by Filter: 10000
         SubPlan 1
           ->  Aggregate  (cost=128.10..128.10 rows=1 width=6) (actual time=3.374..3.374 rows=1 loops=20000)
                 ->  Seq Scan on deltest t  (cost=0.00..128.07 rows=8 width=6) (actual time=0.831..3.344 rows=2 loops=20000)
                       Filter: ( = id)
                       Rows Removed by Filter: 19998
Total runtime: 67758.931 ms
select count(*) from deltest;
count
-------
10000

You can see that the data with the same id is kept with the smallest ctid and other deletes. It is equivalent to deleting half of the data in the deltest table, which takes more than 67s. Quite slow.

2. group by deletion method

The group by method finds the smallest data of ctid through grouping, and then deletes other data.

explain analyse delete from deltest a where  not in (select min(ctid) from deltest group by id);
-------------------------------------------------------------------------------------------
    Delete on deltest a  (cost=131.89..2930.46 rows=763 width=6) (actual time=30942.496..30942.496 rows=0 loops=1)
       ->  Seq Scan on deltest a  (cost=131.89..2930.46 rows=763 width=6) (actual time=10186.296..30814.366 rows=10000 loops=1)
         Filter: (NOT (SubPlan 1))
         Rows Removed by Filter: 10000
         SubPlan 1
           ->  Materialize  (cost=131.89..134.89 rows=200 width=10) (actual time=0.001..0.471 rows=7500 loops=20000)
                 ->  HashAggregate  (cost=131.89..133.89 rows=200 width=10) (actual time=10.568..13.584 rows=10000 loops=1)
                       ->  Seq Scan on deltest  (cost=0.00..124.26 rows=1526 width=10) (actual time=0.006..3.829 rows=20000 loops=1)
     Total runtime: 30942.819 ms
select count(*) from deltest;
count
-------
10000

You can see that half of the data is also deleted, and using group by, saving half of the time. But it still contains 30 seconds. Let’s try the third deletion operation below.

3. Efficient deletion method

explain analyze delete from deltest a where  = any(array (select ctid from (select row_number() over (partition by id), ctid from deltest) t where t.row_number > 1));
-----------------------------------------------------------------------------------------
    Delete on deltest a  (cost=250.74..270.84 rows=10 width=6) (actual time=98.363..98.363 rows=0 loops=1)
    InitPlan 1 (returns 0)−>SubqueryScanont(cost=204.95..250.73rows=509width=6)(actualtime=29.446..47.867rows=10000loops=1)Filter:(>1)RowsRemovedbyFilter:10000−>WindowAgg(cost=204.95..231.66rows=1526width=10)(actualtime=29.436..44.790rows=20000loops=1)−>Sort(cost=204.95..208.77rows=1526width=10)(actualtime=12.466..13.754rows=20000loops=1)SortKey::quicksortMemory:1294kB−>SeqScanondeltest(cost=0.00..124.26rows=1526width=10)(actualtime=0.021..5.110rows=20000loops=1)−>TidScanondeltesta(cost=0.01..20.11rows=10width=6)(actualtime=82.983..88.751rows=10000loops=1)TIDCond:(ctid=ANY(0)−>SubqueryScanont(cost=204.95..250.73rows=509width=6)(actualtime=29.446..47.867rows=10000loops=1)Filter:(>1)RowsRemovedbyFilter:10000−>WindowAgg(cost=204.95..231.66rows=1526width=10)(actualtime=29.436..44.790rows=20000loops=1)−>Sort(cost=204.95..208.77rows=1526width=10)(actualtime=12.466..13.754rows=20000loops=1)SortKey::quicksortMemory:1294kB−>SeqScanondeltest(cost=0.00..124.26rows=1526width=10)(actualtime=0.021..5.110rows=20000loops=1)−>TidScanondeltesta(cost=0.01..20.11rows=10width=6)(actualtime=82.983..88.751rows=10000loops=1)TIDCond:(ctid=ANY(0))
    Total runtime: 98.912 ms
select count(*) from deltest;
count
-------
10000

You can see that it only costs 98ms

Summarize

The above is personal experience. I hope you can give you a reference and I hope you can support me more.