Some common methods of offset...limit paging optimization in PostgreSQL

Discover problems

Most developers are used to using order by offset limit for pagination. Using this method may cause scanned data to be enlarged because offset's rank will be scanned. This is generally the case where the offset line is relatively small, that is, the page turn is very fast, but once the offset value is large and the number of page turn is large, it will definitely slow down.

See an example:

#Create tables, insert data, and analyze and collect statistical informationCREATE UNLOGGED TABLE data (
   id bigint GENERATED ALWAYS AS IDENTITY,
   value double precision NOT NULL,
   created timestamp with time zone NOT NULL
);
 
SELECT setseed(0.2740184);
 
INSERT INTO data (value, created)
SELECT random() * 1000, d
FROM generate_series(
        TIMESTAMP '2022-01-01 00:00:00 UTC',
        TIMESTAMP '2022-12-31 00:00:00 UTC',
        INTERVAL '1 second'
     ) AS d(d);
 

ALTER TABLE data ADD PRIMARY KEY (id);
 
VACUUM (ANALYZE) data;

Let's turn the page to obtain data to see how the performance is:

#As shown in SQL, create an index that is most suitableSELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created;

#Create an indexCREATE INDEX data_created_value_idx ON data
   (created, value);


#You can see that it is Index Only Scan, and the results are very fastpostgres=# explain analyze SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created
LIMIT 50;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..126.51 rows=50 width=16) (actual time=0.022..0.195 rows=50 loops=1)
   -&gt;  Index Only Scan using data_created_value_idx on data  (cost=0.56..801382.02 rows=318146 width=16) (actual time=0.021..0.190 rows=50 loops=1)
         Index Cond: ((value &gt;= '0'::double precision) AND (value &lt;= '10'::double precision))
         Heap Fetches: 0
 Planning Time: 0.084 ms
 Execution Time: 0.210 ms

#But when we OFFSET 200000, the execution plan changes and the cost is relatively high. This is what we mentioned earlier that all the offsets need to be scanned.postgres=# explain analyze SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created
OFFSET 200000 LIMIT 50;
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=434757.47..434763.31 rows=50 width=16) (actual time=2697.793..2704.289 rows=50 loops=1)
   -&gt;  Gather Merge  (cost=411422.51..442355.57 rows=265122 width=16) (actual time=2627.028..2695.579 rows=200050 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         -&gt;  Sort  (cost=410422.49..410753.89 rows=132561 width=16) (actual time=2607.976..2613.950 rows=67369 loops=3)
               Sort Key: created
               Sort Method: external merge  Disk: 2760kB
               Worker 0:  Sort Method: external merge  Disk: 2640kB
               Worker 1:  Sort Method: external merge  Disk: 2640kB
               -&gt;  Parallel Seq Scan on data  (cost=0.00..396876.00 rows=132561 width=16) (actual time=0.042..2551.663 rows=104958 loops=3)
                     Filter: ((value &gt;= '0'::double precision) AND (value &lt;= '10'::double precision))
                     Rows Removed by Filter: 10378242
 Planning Time: 0.102 ms
 Execution Time: 2704.851 ms
(14 rows)

As we have seen above, the more pages turn, the worse the performance. The only benefit is that the writing is simple.

Optimization method 1: Use cursors

Since ordinary cursors can only work in the context of a single transaction. Therefore, ordinary cursors have limited effects on paging, because it is a very poor experience to have user interaction when a transaction is open: long transactions not only keep the table locks for a long time (which may block DDL or TRUNCATE statements), but also block the process of autovacuum, resulting in table bloating.

WITHOUT HOLD The default value is WITHOUT HOLD. Use WITH HOLD to expand the usage range of CURSOR to the SESSION level. WITHOUT HOLD is the TRANSACTION level. In addition, WITH HOLD will consume more resources (memory or temporary files) to maintain data.

For cursor descriptions, please refer to:/docs/13/

#Create a cursor and take out the result set of the conditions that are metbegin;
DECLARE c SCROLL CURSOR WITH HOLD FOR
SELECT value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created;
COMMIT;

#Move the cursor and take out 50 rows, the same effect as OFFSET 200000 LIMIT 50MOVE ABSOLUTE 200000 IN c;
FETCH 50 FROM c;

#After using the cursor, you must remember to close the cursorclose c;

Advantages and disadvantages of using cursors:

advantage:

1. Cursors are suitable for all paging queries, whether it is the first page or the last page, the efficiency is the same

2. The cursor's result set is stable

shortcoming:

1. When the operation is completed, do not forget to close the cursor, otherwise the result set will be saved on the server until the database session ends.

2. If the cursor is opened for a long time, the data will become obsolete and the latest dynamic data cannot be obtained.

3. Opening the cursor for a long time is equivalent to a long thing. I believe everyone has a certain consensus on the negative impact of a long thing.

Optimization method 2: Use site

The principle of a site is very simple, which is to record the results of the last query as a site, and query based on the conditions of this point when querying. This way you can remove the offset. Note that there must be a pk. If there is no, a similar field needs to be added so that the site will not be repeated.

Examples are as follows:

#Remember the starting point of the next page by querying, in the example id is used as pk, identifying uniqueSELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
ORDER BY created, id
LIMIT 50;
  id  |        value        |        created         
------+---------------------+------------------------
.........slightly.........
 4568 |   7.771510504657186 | 2022-01-01 01:16:07+08
 4586 |  1.2500308700502671 | 2022-01-01 01:16:25+08
 4607 |  3.3096537558421346 | 2022-01-01 01:16:46+08
 
#We must remember the value of the id created from the last line of the page.  Then we can take the next pageSELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
  AND (created, id) &gt; ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;

postgres=# explain SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
AND (created, id) &gt; ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;
                                                                               QUERY PLAN                                                                     
          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
----------
 Limit  (cost=4.32..194.08 rows=50 width=24)
   -&gt;  Incremental Sort  (cost=4.32..1207236.72 rows=318103 width=24)
         Sort Key: created, id
         Presorted Key: created
         -&gt;  Index Scan using data_created_value_idx on data  (cost=0.56..1192922.08 rows=318103 width=24)
               Index Cond: ((created &gt;= '2022-01-01 01:16:46+08'::timestamp with time zone) AND (value &gt;= '0'::double precision) AND (value &lt;= '10'::double pr
ecision))
               Filter: (ROW(created, id) &gt; ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607))
(7 rows)

#Add a more suitable index, the execution plan will be betterCREATE INDEX data_keyset_idx ON data (created, id, value);
postgres=#   explain SELECT id, value, created
FROM data
WHERE value BETWEEN 0 AND 10
  AND (created, id) &gt; ('2022-01-01 01:16:46+08', 4607)
ORDER BY created, id
LIMIT 50;
--------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
 Limit  (cost=0.56..160.58 rows=50 width=24)
   -&gt;  Index Only Scan using data_keyset_idx on data  (cost=0.56..1018064.43 rows=318103 width=24)
         Index Cond: ((ROW(created, id) &gt; ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607)) AND (value &gt;= '0'::double precision) AND (value &lt;= '1
0'::double precision))
(3 rows)

Points and disadvantages of site optimization:

advantage:

1. Each query only obtains the data we need, and does not need to scan no additional data, which reduces the cost of related resources.

2. Each query will display the current data modified by the latest concurrent data

shortcoming:

1. Need a special index designed specifically for query

2. Only when you can obtain the exact site in advance will it be useful when querying.

refer to:
/en/pagination-problem-total-result-count/#total-count

Summarize

This is the article about some common methods of offset...limit paging optimization in PostgreSQL. For more related PostgreSQL offset...limit paging optimization content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!