In-depth understanding of postgresql paging data duplication problem

Problem background

Many developers and testers may have encountered the data of the list when turning the next page, and the data from the previous page will be displayed, that is, there will be duplicate data when turning the page.

How to deal with it?

The reason for this problem is that the selected sorting fields are duplicated. The common way to deal with it is to add a unique field when sorting, so that the data will not be repeated during the pagination process. The documentation also explains this issue, which is not a bug. Instead, when sorting, you need to select a unique field to sort, otherwise the returned result will not be sure.

What is the root cause of the duplication of sorting data?

Students who often optimize SQL may find that the execution plan contains the keyword Sort Method, and this keyword is the method of sorting and selection. Abase sorting is divided into three types

quicksort �
top-N heapsort Memory
external merge Disk �

Speculation

The problem of duplication of paging is related to the stability of the execution plan selection sorting algorithm.

A brief introduction to the scenarios of these three sorting algorithms:

In the case of indexing: sorting can directly go to the index. In the case of no index: choose quick sort when the data volume of the table is small (the memory required for sorting must be less than work_mem), choose heap sort when the sorting has limit and the memory consumed does not exceed work_mem, and choose merge sort when work_mem is insufficient.

Verification speculation

1. Create a table and initialize the data

abase=# create table t_sort(n_int int,c_id varchar(300));
CREATE TABLE
abase=# insert into t_sort(n_int,c_id) select 100,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 200,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 300,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 400,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 500,generate_series(1,9);
INSERT 0 9
abase=# insert into t_sort(n_int,c_id) select 600,generate_series(1,9);
INSERT 0 9

Three sorts

--Quick sort quicksort
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
            QUERY PLAN            
------------------------------------------------------------
 Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
 Sort Key: n_int
 Sort Method: quicksort Memory: 27kB
 -&gt; Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
 Planning time: 0.161 ms
 Execution time: 0.104 ms
(6 rows)
--Heap sorting top-N heapsort
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc limit 10;
             QUERY PLAN             
 
------------------------------------------------------------
 Limit (cost=2.71..2.73 rows=10 width=12) (actual time=0.066..0.068 rows=10 loops=1)
 -&gt; Sort (cost=2.71..2.84 rows=54 width=12) (actual time=0.065..0.066 rows=10 loops=1)
   Sort Key: n_int
   Sort Method: top-N heapsort Memory: 25kB
   -&gt; Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.022..0.031 rows=54 loops=1
)
 Planning time: 0.215 ms
 Execution time: 0.124 ms
(7 rows)
--Merge sort external sort Disk
--Insert a large number of values asaData of
abase=# insert into t_sort(n_int,c_id) select generate_series(1000,2000),'a';
INSERT 0 1001
abase=# set work_mem = '64kB';
SET
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
             QUERY PLAN             
-------------------------------------------------------------
 Sort (cost=18.60..19.28 rows=270 width=12) (actual time=1.235..1.386 rows=1055 loops=1)
 Sort Key: n_int
 Sort Method: external sort Disk: 32kB
 -&gt; Seq Scan on t_sort (cost=0.00..7.70 rows=270 width=12) (actual time=0.030..0.247 rows=1055 loops=1)
 Planning time: 0.198 ms
 Execution time: 1.663 ms
(6 rows)

Quick sort

--Quick sort
abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
            QUERY PLAN            
------------------------------------------------------------
 Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
 Sort Key: n_int
 Sort Method: quicksort Memory: 27kB
 -&gt; Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
 Planning time: 0.161 ms
 Execution time: 0.104 ms
(6 rows) 

--Before obtaining20Data
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 20;
  ctid | n_int | c_id 
 --------+-------+------
  (0,7) | 100 | 7
  (0,2) | 100 | 2
  (0,4) | 100 | 4
  (0,8) | 100 | 8
  (0,3) | 100 | 3
  (0,6) | 100 | 6
  (0,5) | 100 | 5
  (0,9) | 100 | 9
  (0,1) | 100 | 1
  (0,14) | 200 | 5
  (0,13) | 200 | 4
  (0,12) | 200 | 3
  (0,10) | 200 | 1
  (0,15) | 200 | 6
  (0,16) | 200 | 7
  (0,17) | 200 | 8
  (0,11) | 200 | 2
  (0,18) | 200 | 9
  (0,20) | 300 | 2
  (0,19) | 300 | 1
 (20 rows)  --分页Before obtaining10Data
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 0;
  ctid | n_int | c_id 
 --------+-------+------
  (0,1) | 100 | 1
  (0,3) | 100 | 3
  (0,4) | 100 | 4
  (0,2) | 100 | 2
  (0,6) | 100 | 6
  (0,7) | 100 | 7
  (0,8) | 100 | 8
  (0,9) | 100 | 9
  (0,5) | 100 | 5
  (0,10) | 200 | 1
 (10 rows)
 --Page from10Start getting10strip
 abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 10;
  ctid | n_int | c_id 
 --------+-------+------
  (0,13) | 200 | 4
  (0,12) | 200 | 3
  (0,10) | 200 | 1
  (0,15) | 200 | 6
  (0,16) | 200 | 7
  (0,17) | 200 | 8
  (0,11) | 200 | 2
  (0,18) | 200 | 9
  (0,20) | 300 | 2
  (0,19) | 300 | 1
 (10 rows)

limit 10 offset 0,limit 10 offset 10，Get two pages of data in a row

Here you can see in the limit 10 offset 10 results, the third piece of data repeats the last item on the first page: (0,10) | 200 | 1

And n_int = 200 and c_id = 5 is "missed".

Heap sorting

abase=# select count(*) from t_sort;
 count 
-------
 1055
(1 row)
--set upwork_mem 4MB
abase=# show work_mem ;
 work_mem 
----------
 4MB
(1 row)

--top-N heapsort
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
              QUERY PLAN           
    
-------------------------------------------------------------------------------------------------------------
 Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=15.247..15.251 rows=10 loops=1)
 -&gt; Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=15.245..15.247 rows=10 loops=1)
   -&gt; Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=15.244..15.245 rows=10 loops=1)
    Sort Key: test.n_int
    Sort Method: top-N heapsort Memory: 95kB
    -&gt; Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.042..7.627 rows=2
9757 loops=1)
 Planning time: 0.376 ms
 Execution time: 15.415 ms
(8 rows)

--Getlimit 1001 offset 0，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,6) | 100 | 6
 (0,2) | 100 | 2
 (0,5) | 100 | 5
 (87,195) | 100 | 888
 (0,3) | 100 | 3
 (0,1) | 100 | 1
 (0,8) | 100 | 8
 (0,55) | 100 | 888
 (44,12) | 100 | 888
 (0,9) | 100 | 9
(10 rows)
---Getlimit 1001 offset 1，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (44,12) | 100 | 888
 (0,8) | 100 | 8
 (0,1) | 100 | 1
 (0,5) | 100 | 5
 (0,9) | 100 | 9
 (87,195) | 100 | 888
 (0,7) | 100 | 7
 (0,6) | 100 | 6
 (0,3) | 100 | 3
 (0,4) | 100 | 4
(10 rows)

---Getlimit 1001 offset 2，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,5) | 100 | 5
 (0,55) | 100 | 888
 (0,1) | 100 | 1
 (0,9) | 100 | 9
 (0,2) | 100 | 2
 (0,3) | 100 | 3
 (44,12) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
 (0,4) | 100 | 4
(10 rows)

Heap sorting uses memory:Sort Method: top-N heapsort Memory: 95kB

When the offset changes from 0 to 1, and when it becomes 2, it will be found that the 10 pieces of data queryed are not in sequence.

Merge sort

--Willwork_memSet as64kbLet them go and sort。
abase=# set work_mem ='64kB';
SET
abase=# show work_mem;
 work_mem 
----------
 64kB
(1 row)

-- external merge Disk
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=27.912..27.916 rows=10 loops=1)
 -&gt; Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=27.910..27.913 rows=10 loops=1)
   -&gt; Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=27.909..27.911 rows=10 loops=1)
    Sort Key: test.n_int
    Sort Method: external merge Disk: 784kB
    -&gt; Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.024..6.730 rows=29757 loops=1)
 Planning time: 0.218 ms
 Execution time: 28.358 ms
(8 rows)

--Same as heap sort，Getlimit 1001 offset 0，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
 ctid | n_int | c_id 
--------+-------+------
 (0,1) | 100 | 1
 (0,2) | 100 | 2
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
(10 rows)

--Same as heap sort，Getlimit 1001 offset 1，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,2) | 100 | 2
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
(10 rows)

--Same as heap sort，Getlimit 1001 offset 2，Then take10forward10Data
abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
 ctid | n_int | c_id 
----------+-------+------
 (0,4) | 100 | 4
 (0,8) | 100 | 8
 (0,9) | 100 | 9
 (0,5) | 100 | 5
 (0,3) | 100 | 3
 (0,6) | 100 | 6
 (0,55) | 100 | 888
 (0,7) | 100 | 7
 (87,195) | 100 | 888
 (44,12) | 100 | 888
(10 rows)

When reducing the work_mem using merge sorting, the offset changes from 0 to 1 and then becomes 2, and remains in order.

There is another situation, that is, there will be duplication when querying the previous few pages, but the more you look back, the more you will not repeat. It can be explained clearly now.

If 10 pieces of data per page, less memory is used when the offse is smaller. As the offse continues to grow, the more memory it consumes.

--set upwork_mem =64kb
abase=# show work_mem;
 work_mem 
----------
 64kB
(1 row)
--Querylimit 10 offset 10
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 10) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=1221.42..1221.54 rows=10 width=13) (actual time=12.881..12.884 rows=10 loops=1)
 -&gt; Limit (cost=1221.42..1221.44 rows=10 width=13) (actual time=12.879..12.881 rows=10 loops=1)
   -&gt; Sort (cost=1221.39..1295.79 rows=29757 width=13) (actual time=12.877..12.879 rows=20 loops=1)
    Sort Key: test.n_int
    Sort Method: top-N heapsort Memory: 25kB
    -&gt; Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.058..6.363 rows=29757 loops=1)
 Planning time: 0.230 ms
 Execution time: 12.976 ms
(8 rows)

--Querylimit 10 offset 1000
abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 1000) td limit 10;
              QUERY PLAN               
---------------------------------------------------------------------------------------------------------------------------
 Limit (cost=2065.75..2065.88 rows=10 width=13) (actual time=27.188..27.192 rows=10 loops=1)
 -&gt; Limit (cost=2065.75..2065.78 rows=10 width=13) (actual time=27.186..27.188 rows=10 loops=1)
   -&gt; Sort (cost=2063.25..2137.64 rows=29757 width=13) (actual time=26.940..27.138 rows=1010 loops=1)
    Sort Key: test.n_int
    Sort Method: external merge Disk: 784kB
    -&gt; Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.026..6.374 rows=29757 loops=1)
 Planning time: 0.207 ms
 Execution time: 27.718 ms
(8 rows)

You can see that when offset increases from 10 to 1000, the memory used increases, and the sorting method changes from heap sorting to merge sorting. The merge sort is a stable sort, so the next page will no longer have the data on the previous page on the next page.

References:PostgreSQL - repeating rows from LIMIT OFFSET

References:LIMIT and OFFSET

Conclusion

1. The main problem with paging duplicate data is that the sorting fields are not unique and the execution plan is done by quick sorting and heap sorting.

2. When the sorting has duplicate fields, but if the query is sorting in conjunction, there will be no problem with duplicate data.

3. When sorting with duplicate fields, the previous page is repeated. As the offset increases, the work_mem is insufficient, and then the merge sort is used, there will be no duplicate data.

4. Sorting is related to the stability of the algorithm. When different sorting algorithms are selected for the execution plan, the returned results are different.

5. A common method for processing duplicate data is that when sorting, you can add c_bh (unique field) to sort the sorting field d_larq (counter filed date).

order by d_larq,c_bh;

Summarize

The above is the entire content of this article. I hope that the content of this article has certain reference value for your study or work. Thank you for your support.