SoFunction
Updated on 2025-03-03

Description of usage of Postgresql deduplication function distinct

In projects, we often deduplicate the data, and sometimes use the in or EXISTS function. Or you can also use group by to check the diversion

However, Postgresql also has its own deduplication function: distinct

Here is an example of distinct:

1. Create table: user

CREATE TABLE `user` (
 `name` varchar(30) DEFAULT NULL,
 `age` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `user` VALUES ('Zhang San', 20);
INSERT INTO `user` VALUES ('Li Si', 22);
INSERT INTO `user` VALUES ('Li Si', 20);
INSERT INTO `user` VALUES ('Zhang San', 22);
INSERT INTO `user` VALUES ('Zhang San', 20);

Query results:

SELECT * FROM user
Zhang San 20
Li Si 22
Li Si 20
Zhang San 22
Zhang San 20

2. Query the data after deduplication based on name:

SELECT distinct name FROM user
Zhang San
Li Si

3. Query the data after deduplication based on name and age:

SELECT distinct name,age FROM user
Zhang San 20
Li Si 22
Li Si 20
Zhang San 22

4. Query the number of duplicate data based on name and age:

SELECT distinct name,age,count(*) Number of data strips FROM user GROUP BY name,age
Zhang San 20 2
Zhang San 22 1
Li Si 20 1
Li Si 22 1

2. After finding duplicate data, we need to delete duplicate data.

There are several ways to delete duplicate data. Generally, temporary tables are used or deduplication is used according to a certain field, such as id, etc., through the max or min function.

Supplement: Implement differential synchronization of data based on postgresql ctid

Project background:

Recently, we have been doing heterogeneous data synchronization (non-real-time) work, syncing from oracle, gbase, postgresql to mysql database. For business tables without self-increment fields (self-increment ID or time fields), differential synchronization is a very troublesome thing, mainly reflected in the addition, update and deletion of records.

Note: The source library only provides one read-only user

The role of ctid in pg

ctid is a tuple identifier used to point to itself or new tuple. How to understand it? Here are some experiments to test

satdb=# create table test_ctid(id int,name varchar(100));
satdb=# insert into test_ctid values(1,‘a'),(1,‘a');
satdb=# insert into test_ctid values(2,‘a'),(3,‘a');

View the record's ctid value

satdb=# select id,name,ctid from test_ctid;
id | name | ctid
----±-----±------
1 | a | (0,1)
1 | a | (0,2)
2 | a | (0,3)
3 | a | (0,4)
(4 rows)

Update records with id 2

satdb=# update test_ctid set name=‘b' where id=2;
UPDATE 1

Here you can see that the record with id=2 points to the new tuple identifier (0,5)

satdb=# select id,name,ctid from test_ctid;
id | name | ctid
----±-----±------
1 | a | (0,1)
1 | a | (0,2)
3 | a | (0,4)
2 | b | (0,5)
(4 rows)
satdb=# select * from test_ctid where ctid='(0,1)';
id | name
----±-----
1 | a
(1 row)

After deleting the record with id=3, the corresponding ctid(0,4) no longer exists.

satdb=# delete from test_ctid where id=3;
DELETE 1
satdb=# select *,ctid from test_ctid;
id | name | ctid
----±-----±------
1 | a | (0,1)
1 | a | (0,2)
2 | b | (0,5)
(3 rows)

When inserting another record, see if the identifier (0,4) will be used.

satdb=# insert into test_ctid values(3,‘d');
INSERT 0 1
satdb=# select *,ctid from test_ctid;
id | name | ctid
----±-----±------
1 | a | (0,1)
1 | a | (0,2)
2 | b | (0,5)
3 | d | (0,6)

Here, the newly inserted record will not be used (0,4), but will be directly assigned a new identifier (0,6)

Summarize:

1. The function of ctid is similar to oracle rowid, which can uniquely identify a record

2. After the record is updated, a new ctid will be produced.

3. After the record is deleted, the newly inserted record will not use the ctid that has been deleted.

4. Record deduplication operation can be realized based on ctid

5. Differential incremental synchronization can be achieved based on ctid (add, delete, update)

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.