In projects, we often deduplicate the data, and sometimes use the in or EXISTS function. Or you can also use group by to check the diversion
However, Postgresql also has its own deduplication function: distinct
Here is an example of distinct:
1. Create table: user
CREATE TABLE `user` ( `name` varchar(30) DEFAULT NULL, `age` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO `user` VALUES ('Zhang San', 20); INSERT INTO `user` VALUES ('Li Si', 22); INSERT INTO `user` VALUES ('Li Si', 20); INSERT INTO `user` VALUES ('Zhang San', 22); INSERT INTO `user` VALUES ('Zhang San', 20);
Query results:
SELECT * FROM user Zhang San 20 Li Si 22 Li Si 20 Zhang San 22 Zhang San 20
2. Query the data after deduplication based on name:
SELECT distinct name FROM user Zhang San Li Si
3. Query the data after deduplication based on name and age:
SELECT distinct name,age FROM user Zhang San 20 Li Si 22 Li Si 20 Zhang San 22
4. Query the number of duplicate data based on name and age:
SELECT distinct name,age,count(*) Number of data strips FROM user GROUP BY name,age Zhang San 20 2 Zhang San 22 1 Li Si 20 1 Li Si 22 1
2. After finding duplicate data, we need to delete duplicate data.
There are several ways to delete duplicate data. Generally, temporary tables are used or deduplication is used according to a certain field, such as id, etc., through the max or min function.
Supplement: Implement differential synchronization of data based on postgresql ctid
Project background:
Recently, we have been doing heterogeneous data synchronization (non-real-time) work, syncing from oracle, gbase, postgresql to mysql database. For business tables without self-increment fields (self-increment ID or time fields), differential synchronization is a very troublesome thing, mainly reflected in the addition, update and deletion of records.
Note: The source library only provides one read-only user
The role of ctid in pg
ctid is a tuple identifier used to point to itself or new tuple. How to understand it? Here are some experiments to test
satdb=# create table test_ctid(id int,name varchar(100)); satdb=# insert into test_ctid values(1,‘a'),(1,‘a'); satdb=# insert into test_ctid values(2,‘a'),(3,‘a');
View the record's ctid value
satdb=# select id,name,ctid from test_ctid; id | name | ctid ----±-----±------ 1 | a | (0,1) 1 | a | (0,2) 2 | a | (0,3) 3 | a | (0,4) (4 rows)
Update records with id 2
satdb=# update test_ctid set name=‘b' where id=2; UPDATE 1
Here you can see that the record with id=2 points to the new tuple identifier (0,5)
satdb=# select id,name,ctid from test_ctid; id | name | ctid ----±-----±------ 1 | a | (0,1) 1 | a | (0,2) 3 | a | (0,4) 2 | b | (0,5) (4 rows) satdb=# select * from test_ctid where ctid='(0,1)'; id | name ----±----- 1 | a (1 row)
After deleting the record with id=3, the corresponding ctid(0,4) no longer exists.
satdb=# delete from test_ctid where id=3; DELETE 1 satdb=# select *,ctid from test_ctid; id | name | ctid ----±-----±------ 1 | a | (0,1) 1 | a | (0,2) 2 | b | (0,5) (3 rows)
When inserting another record, see if the identifier (0,4) will be used.
satdb=# insert into test_ctid values(3,‘d'); INSERT 0 1 satdb=# select *,ctid from test_ctid; id | name | ctid ----±-----±------ 1 | a | (0,1) 1 | a | (0,2) 2 | b | (0,5) 3 | d | (0,6)
Here, the newly inserted record will not be used (0,4), but will be directly assigned a new identifier (0,6)
Summarize:
1. The function of ctid is similar to oracle rowid, which can uniquely identify a record
2. After the record is updated, a new ctid will be produced.
3. After the record is deleted, the newly inserted record will not use the ctid that has been deleted.
4. Record deduplication operation can be realized based on ctid
5. Differential incremental synchronization can be achieved based on ctid (add, delete, update)
The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.