In the database, you often encounter some table columns with sparse columns and only a few values, such as gender fields, and generally there are only 2 different values.
But when we find the unique values of these sparse columns, if the table's data volume is large, the speed will still be very slow.
For example:
Create a test table
bill=# create table t_sex (sex char(1), otherinfo text); CREATE TABLE bill=# insert into t_sex select 'm', generate_series(1,10000000)||'this is test'; INSERT 0 10000000 bill=# insert into t_sex select 'w', generate_series(1,10000000)||'this is test'; INSERT 0 10000000
Query:
You can see that the query below is very slow.
bill=# select count(distinct sex) from t_sex; count ------- 2 (1 row) Time: 8803.505 ms (00:08.804) bill=# select sex from t_sex t group by sex; sex ----- m w (2 rows) Time: 1026.464 ms (00:01.026)
So what happens when we add an index to this field?
The speed is still not obvious
bill=# create index idx_sex_1 on t_sex(sex); CREATE INDEX bill=# select count(distinct sex) from t_sex; count ------- 2 (1 row) Time: 8502.460 ms (00:08.502) bill=# select sex from t_sex t group by sex; sex ----- m w (2 rows) Time: 572.353 ms
The change can be seen that the execution plan has already used Index Only Scan.
bill=# explain select count(distinct sex) from t_sex; QUERY PLAN ---------------------------------------------------------------------------------------------- Aggregate (cost=371996.44..371996.45 rows=1 width=8) -> Index Only Scan using idx_sex_1 on t_sex (cost=0.44..321996.44 rows=20000000 width=2) (2 rows)
Let's see how the same SQL performs in Oracle?
Create a test table:
SQL> create table t_sex (sex char(1), otherinfo varchar2(100)); Table created. SQL> insert into t_sex select 'm', rownum||'this is test' from dual connect by level <=10000000; 10000000 rows created. SQL> commit; Commit complete. SQL> insert into t_sex select 'w', rownum||'this is test' from dual connect by level <=10000000; 10000000 rows created. SQL> commit; Commit complete.
Performance Test:
SQL> set lines 1000 pages 2000 SQL> set autotrace on SQL> set timing on SQL> select count(distinct sex) from t_sex; COUNT(DISTINCTSEX) ------------------ 2 Elapsed: 00:00:01.58 Execution Plan ---------------------------------------------------------- Plan hash value: 3915432945 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3 | 20132 (1)| 00:00:01 | | 1 | SORT GROUP BY | | 1 | 3 | | | | 2 | TABLE ACCESS FULL| T_SEX | 14M| 42M| 20132 (1)| 00:00:01 | ---------------------------------------------------------------------------- Note ----- - dynamic statistics used: dynamic sampling (level=2) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 74074 consistent gets 0 physical reads 0 redo size 552 bytes sent via SQL*Net to client 608 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 1 rows processed SQL> select sex from t_sex t group by sex; SE -- m w Elapsed: 00:00:01.08 Execution Plan ---------------------------------------------------------- Plan hash value: 3915432945 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 14M| 42M| 20558 (3)| 00:00:01 | | 1 | SORT GROUP BY | | 14M| 42M| 20558 (3)| 00:00:01 | | 2 | TABLE ACCESS FULL| T_SEX | 14M| 42M| 20132 (1)| 00:00:01 | ---------------------------------------------------------------------------- Note ----- - dynamic statistics used: dynamic sampling (level=2) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 74074 consistent gets 0 physical reads 0 redo size 589 bytes sent via SQL*Net to client 608 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 2 rows processed
You can see that Oracle's performance is obviously better than in PostgreSQL even without indexing.
So is there no way for us to continue to optimize in PostgreSQL? In this case, we can greatly improve performance by using recursive statements in pg combined with indexes.
SQL rewrite:
bill=# with recursive tmp as ( bill(# ( bill(# select min() as sex from t_sex t where is not null bill(# ) bill(# union all bill(# ( bill(# select (select min() from t_sex t where > and is not null) bill(# from tmp s where is not null bill(# ) bill(# ) bill-# select count(distinct sex) from tmp; count ------- 2 (1 row) Time: 2.711 ms
View execution plan:
bill=# explain with recursive tmp as ( bill(# ( bill(# select min() as sex from t_sex t where is not null bill(# ) bill(# union all bill(# ( bill(# select (select min() from t_sex t where > and is not null) bill(# from tmp s where is not null bill(# ) bill(# ) bill-# select count(distinct sex) from tmp; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Aggregate (cost=53.62..53.63 rows=1 width=8) CTE tmp -> Recursive Union (cost=0.46..51.35 rows=101 width=32) -> Result (cost=0.46..0.47 rows=1 width=32) InitPlan 3 (returns $1) -> Limit (cost=0.44..0.46 rows=1 width=2) -> Index Only Scan using idx_sex_1 on t_sex t (cost=0.44..371996.44 rows=20000000 width=2) Index Cond: (sex IS NOT NULL) -> WorkTable Scan on tmp s (cost=0.00..4.89 rows=10 width=32) Filter: (sex IS NOT NULL) -> CTE Scan on tmp (cost=0.00..2.02 rows=101 width=32) (11 rows) Time: 1.371 ms
It can be seen that the execution time has been reduced from the original 8000ms to 2ms, which has been increased by thousands of times!
Even compared to Oracle, the performance has been improved a lot.
But it is important to note:This writing method is only for sparse columns. If it is replaced with fields with widely distributed data, the performance is obviously degraded. Therefore, using recursive SQL is not suitable for group by or count (distinct) operations of fields with widely distributed data.
This is the end of this article about PostgreSQL using recursive optimization to find unique values for sparse columns. For more related PostgreSQL recursive optimization content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!