PostgreSQL non-count method to calculate the record number operation

General Methods

select count(1) from table_name;

Scan the table in full and the more records, the slower the query speed

New law

PostgreSQL really provides a way, that is, the system table pg_class, which stores statistical information of each table in the system table. Among them, reltuples is the statistical row of the corresponding table. The data of the statistical row is that pg has an independent process that scans different tables regularly, collects statistical information of these tables, and saves them in the system table.

The method is as follows:

select 
 reltuples::int as total 
from 
 pg_class 
where 
 relname = 'table_name' 
 and relnamespace = (select oid from pg_namespace where nspname = 'schema');

The new method is not universal. If you require special accuracy, select count(1) will still be used. If it is similar to paging and the number of paging exceeds a relatively large number of paging, it is not particularly accurate. This is a good method!

count(1) over Calculate the number of records

select count(1) over(), * from table_name;

Replenish

count is one of the most commonly used aggregation functions. It seems simple, but there are still pitfalls, such as:

1. count(*): Returns the number of rows in the result set, and is null and counts.

2. count(1): There is basically no difference between count(*). Before pg92, the entire table was scanned. After pg92, index only scan will generally become a scan primary key index. If there is no primary key or there are many columns in the table, count(1) is faster because all fields of the table will not be considered.

3. count(field): Returns the number of rows whose specified field value is not equal to null

Extension: Understand the behavior of PostgreSQL's count function

There has always been controversy about the use of count functions, especially in MySQL. Does PostgreSQL, which is increasingly popular, have similar problems? Let’s understand the behavior of count functions in PostgreSQL through practice.

Build a test database

Create a test database and create a test table. There are three fields in the test table: self-increment ID, creation time, and content. The self-increment ID field is the primary key.

create database performance_test;

create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));

Generate test data

Use the generate_series function to generate the autoincrement ID, use the now() function to generate the created_at column, and for the content column, repeat(md5(random()::text), 10) is used to generate 10 32-bit length md5 strings. Use the following statement to insert 1000w records for testing.

performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)

Thoughts triggered by count statements

By default, PostgreSQL does not enable the display of SQL execution time, so it needs to be turned on manually to facilitate the subsequent test comparison.

\timing on

The performance difference between count(*) and count(1) is a frequently discussed issue, using count(*) and count(1) to perform a query respectively.

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 115090.380 ms (01:55.090)
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 738.502 ms

You can see that the speed of the two queries is very different. How is count(1) really such a big performance improvement? Next, run the query statement again.

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 657.831 ms
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 682.157 ms

You can see that the first query will be very slow, and the next three times are very fast and the time is similar. There are two problems here:

Why is the first query so slow?

Does count(*) and count(1) have any performance differences?

Query cache

Re-execute the query using the explain statement

explain (analyze,buffers,verbose) select count(*) from test_tbl;

You can see the following output:

Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=882.569..882.570 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=96 read=476095
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=882.492..884.170 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=96 read=476095
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=881.014..881.014 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=96 read=476095
        Worker 0: actual time=880.319..880.319 rows=1 loops=1
         Buffers: shared hit=34 read=158206
        Worker 1: actual time=880.369..880.369 rows=1 loops=1
         Buffers: shared hit=29 read=156424
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=0.029..662.165 rows=3333333 loops=3)
           Buffers: shared hit=96 read=476095
           Worker 0: actual time=0.026..661.807 rows=3323029 loops=1
            Buffers: shared hit=34 read=158206
           Worker 1: actual time=0.030..660.197 rows=3285513 loops=1
            Buffers: shared hit=29 read=156424
 Planning time: 0.043 ms
 Execution time: 884.207 ms

Note that the shared hit inside indicates that the cached data in memory is hit, which can explain why the subsequent query is much faster than the first time. Next, remove the cache and restart PostgreSQL.

service postgresql stop
echo 1 > /proc/sys/vm/drop_caches
service postgresql start

Re-executing SQL statements is much slower.

 Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=50604.564..50604.564 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=476191
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=50604.508..50606.141 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared read=476191
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=50591.550..50591.551 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared read=476191
        Worker 0: actual time=50585.182..50585.182 rows=1 loops=1
         Buffers: shared read=158122
        Worker 1: actual time=50585.181..50585.181 rows=1 loops=1
         Buffers: shared read=161123
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=92.491..50369.691 rows=3333333 loops=3)
           Buffers: shared read=476191
           Worker 0: actual time=122.170..50362.271 rows=3320562 loops=1
            Buffers: shared read=158122
           Worker 1: actual time=14.020..50359.733 rows=3383583 loops=1
            Buffers: shared read=161123
 Planning time: 11.537 ms
 Execution time: 50606.215 ms

shared read means there is no hit cache. From this phenomenon, it can be inferred that among the four queries in the previous section, the first query did not hit cache, and the remaining three queries hit the cache.

The difference between count(1) and count(*)

Next, we explore the difference between count(1) and count(*) and continue to think about the first four queries. The first query used count(*) and the second query used count(1) but still hit the cache. Isn’t it just that count(1) and count(*) are the same?

In fact, the official PostgreSQL response to the question is there a difference performance-wise between select count(1) and select count(*)? also confirms this:

Nope. In fact, the latter is converted to the former during parsing.[2]

Since count(1) is not better than count(*) in performance, then using count(*) is a better choice.

sequence scan and index scan

Next, test the speed of count(*) under different data sizes, write the query statement in a file, and use pgbench for testing.

pgbench -c 5 -t 20 performance_test -r -f

Test the count statements under 200w - 1000w data volume respectively.

Data size	count time-consuming (ms)
200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

Drawing into a time-consuming curve

The curve trend has turned between 600w - 700w data volume, 200w - 600w is a linear growth, and the time-consuming time of count after 600w is basically the same. Use the explain statement to view 600w and 700w data respectively to execute the count statement.

700w：

Finalize Aggregate (cost=502185.93..502185.94 rows=1 width=8) (actual time=894.361..894.361 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=16344 read=352463
  -> Gather (cost=502185.72..502185.93 rows=2 width=8) (actual time=894.232..899.763 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=16344 read=352463
     -> Partial Aggregate (cost=501185.72..501185.73 rows=1 width=8) (actual time=889.371..889.371 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=16344 read=352463
        Worker 0: actual time=887.112..887.112 rows=1 loops=1
         Buffers: shared hit=5459 read=118070
        Worker 1: actual time=887.120..887.120 rows=1 loops=1
         Buffers: shared hit=5601 read=117051
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..493863.32 rows=2928960 width=0) (actual time=0.112..736.376 rows=2333333 loops=3)
           Index Cond: (test_tbl.id < 7000000)
           Heap Fetches: 2328492
           Buffers: shared hit=16344 read=352463
           Worker 0: actual time=0.107..737.180 rows=2344479 loops=1
            Buffers: shared hit=5459 read=118070
           Worker 1: actual time=0.133..737.960 rows=2327028 loops=1
            Buffers: shared hit=5601 read=117051
 Planning time: 0.165 ms
 Execution time: 899.857 ms

600w：

Finalize Aggregate (cost=429990.94..429990.95 rows=1 width=8) (actual time=765.575..765.575 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=13999 read=302112
  -> Gather (cost=429990.72..429990.93 rows=2 width=8) (actual time=765.557..770.889 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=13999 read=302112
     -> Partial Aggregate (cost=428990.72..428990.73 rows=1 width=8) (actual time=763.821..763.821 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=13999 read=302112
        Worker 0: actual time=762.742..762.742 rows=1 loops=1
         Buffers: shared hit=4638 read=98875
        Worker 1: actual time=763.308..763.308 rows=1 loops=1
         Buffers: shared hit=4696 read=101570
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..422723.16 rows=2507026 width=0) (actual time=0.053..632.199 rows=2000000 loops=3)
           Index Cond: (test_tbl.id < 6000000)
           Heap Fetches: 2018490
           Buffers: shared hit=13999 read=302112
           Worker 0: actual time=0.059..633.156 rows=1964483 loops=1
            Buffers: shared hit=4638 read=98875
           Worker 1: actual time=0.038..634.271 rows=2017026 loops=1
            Buffers: shared hit=4696 read=101570
 Planning time: 0.055 ms
 Execution time: 770.921 ms

Based on the above phenomenon, PostgreSQL seems to use index scan when the amount of data in count is less than a certain proportion of the length of the data table. You can also see the relevant description by viewing the official wiki:

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]

According to the answer on *, when the number of queries in the count statement is greater than 3/4 of the table size, the use of full table scan instead of index scan [4].

in conclusion

Do not use count(1) or count(column name) instead of count(*)

count itself is very time-consuming

count may be index scan or sequence scan, depending on the proportion of count count to table size

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.