1. Scalar subquery and filter
When a query is between select and from, this subquery is a scalar subquery. In practical applications, many people write a bunch of scalar subquery SQLs for convenience when writing SQL. When the table data is not large, it generally does not have any impact, but when the amount of data is large, it often has a huge impact on performance.
Because scalar subquery is similar to a natural nested loop, and the driver table is fixed as the main table. As shown below:
bill=# explain select empno,ename,sal,deptno, bill-# (select from dept d where = ) as dname bill-# from emp e; QUERY PLAN -------------------------------------------------------------- Seq Scan on emp e (cost=0.00..15.84 rows=14 width=64) SubPlan 1 -> Seq Scan on dept d (cost=0.00..1.05 rows=1 width=9) Filter: (deptno = ) (4 rows)
For the above SQL, every time the emp table outputs a row of data, the entire table in the dept table must be scanned.
We all know that the join columns of the driven tables in nested loops must be included in the index, and similarly, the join columns of the tables in scalar subquery must also be included in the index. However, when we actually write SQL, we should avoid using scalar subqueries. Otherwise, when the main table returns a large amount of data, the subtables must be traversed many times, which has a huge impact on SQL performance.
So how should we optimize SQL for scalar subquery? The most commonly used one is to rewritten it into an external connection, so that for PostgreSQL optimizer, you can select the connection method of the table according to actual conditions. It should be noted here that scalar subquery cannot be changed to an intra-join. We can also see in the previous example that scalar subquery is actually a process of passing values. When the main table passes the value to the sub-table, NULL will be displayed if there is no corresponding value. If an intra-join is used, this part of the data will be lost.
Therefore, the above scalar subquery can be rewritten as:
It can be seen that the optimizer has selected a more suitable hash join based on the actual situation.
bill=# explain select ,,,, bill-# from emp e bill-# left join dept d on ( = ); QUERY PLAN ------------------------------------------------------------------- Hash Left Join (cost=1.09..2.31 rows=14 width=27) Hash Cond: ( = ) -> Seq Scan on emp e (cost=0.00..1.14 rows=14 width=18) -> Hash (cost=1.04..1.04 rows=4 width=13) -> Seq Scan on dept d (cost=0.00..1.04 rows=4 width=13) (5 rows)
When the main table join column is a foreign key and the child table join column is a primary key, it is also possible to use an inner join, because the foreign key will naturally not have a NULL value.
bill=# explain select ,,,, bill-# from emp e bill-# inner join dept d on ( = ); QUERY PLAN ------------------------------------------------------------------- Hash Join (cost=1.09..2.31 rows=14 width=27) Hash Cond: ( = ) -> Seq Scan on emp e (cost=0.00..1.14 rows=14 width=18) -> Hash (cost=1.04..1.04 rows=4 width=13) -> Seq Scan on dept d (cost=0.00..1.04 rows=4 width=13) (5 rows)
In addition to scalar subqueries, filters often produce similar situations, because the driver tables are also fixed in the filter, so the optimizer may choose an inefficient execution plan. For PostgreSQL, it does not support the hint function itself. If the wrong execution plan is fixed, it is often only possible to rewrite SQL.
Here we explain filters. There are two main cases of filters in PostgreSQL. One is our common where filter data is filtered later. This generally does not cause any performance problems, such as:
bill=# explain select * from t where id < 10; QUERY PLAN ------------------------------------------------------- Seq Scan on t (cost=0.00..16925.00 rows=100 width=4) Filter: (id < 10) (2 rows)
Another type is that there are some table connection conditions in filter. This is what we mentioned earlier and often needs to pay attention to, for example:
bill=# explain select exists (select 1 from t where =) from n; QUERY PLAN ------------------------------------------------------------- Seq Scan on n (cost=0.00..169250145.00 rows=10000 width=1) SubPlan 1 -> Seq Scan on t (cost=0.00..16925.00 rows=1 width=0) Filter: (id = ) (4 rows)
So which writing methods will easily produce filters? When using exists or not exists in PostgreSQL, or if there are keywords for fixed subqueries in subqueries, such as union, union all, cube, rollup, limit, etc., then the execution plan is often prone to filters.
Therefore, we use in to replace exists for SQL above and rewrite:
bill=# explain select id in (select id from t) from n; QUERY PLAN ------------------------------------------------------------------------- Seq Scan on n (cost=0.00..129160170.00 rows=10000 width=1) SubPlan 1 -> Materialize (cost=0.00..23332.00 rows=1000000 width=4) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (4 rows)
In addition, in PostgreSQL, we recommend using = any method to overwrite this class SQL:
bill=# explain select id = any(array(select id from t)) from n; QUERY PLAN ------------------------------------------------------------------- Seq Scan on n (cost=14425.00..14695.00 rows=10000 width=1) InitPlan 1 (returns $0) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (3 rows)
Of course, this does not mean that the writing method of in is definitely better than exists, but it is less likely to produce filters than exists. Why is this? Because if the subquery contains the solidified keywords we mentioned above, the subquery will be solidified into a whole. When using exists writing, if there are join columns of the main table in the subquery, then the main table can only pass the value to the table in the subquery through the join column, so filter will be selected. Using the in writing method, even if the subquery is solidified, if there are no fields for the main table joining columns, then filter will not be selected.
2. View merge
I wonder if you have encountered any situations like the following:
select xxx from () t1, () t2 where = ;
The two subqueries t1 and t2 are clearly executed very quickly, but the speed of putting them together becomes particularly slow. This situation is often caused by view merging.
For example, the following SQL:
According to the order in SQL, the two tables of emp and dept should be associated first, and then the salgrade table should be associated. However, the order in the execution plan becomes that the emp and salgrade tables are first associated, and then the dept tables are finally associated.
This indicates that view merge occurs, i.e. the content in the view/subquery is disassembled.
bill=# explain select a.*, bill-# from (select ename,sal,, bill(# from emp a,dept b bill(# where = ) a, bill-# salgrade c bill-# where between and ; QUERY PLAN ----------------------------------------------------------------------------- Hash Join (cost=1.09..4.56 rows=8 width=27) Hash Cond: ( = ) -> Nested Loop (cost=0.00..3.43 rows=8 width=18) Join Filter: (( >= ) AND ( <= )) -> Seq Scan on emp a (cost=0.00..1.14 rows=14 width=14) -> Materialize (cost=0.00..1.07 rows=5 width=12) -> Seq Scan on salgrade c (cost=0.00..1.05 rows=5 width=12) -> Hash (cost=1.04..1.04 rows=4 width=13) -> Seq Scan on dept b (cost=0.00..1.04 rows=4 width=13) (9 rows)
From the above example, we can see that view merging generally causes performance problems due to changes in the join order of tables after view merging. However, the general optimizer does this to help us choose a more appropriate table join order, and when the optimizer selects the wrong join order, it is necessary for us to override SQL.
Since we cannot use hint in PostgreSQL to make the optimizer prohibit view merging, we need to understand some SQL rewriting techniques.
Like the previous filter, when we solidify the view/subquery, then view merging cannot be performed. Therefore, we can rewritten the above SQL as:
After adding group by, the subquery is solidified, the view is not merged, and the emp and dept tables are related first.
bill=# explain select a.*, bill-# from (select ename,sal,, bill(# from emp a,dept b bill(# where = group by ename,sal,,) a, bill-# salgrade c bill-# where between and ; QUERY PLAN ------------------------------------------------------------------------------- Nested Loop (cost=2.45..5.02 rows=8 width=27) Join Filter: (( >= ) AND ( <= )) -> HashAggregate (cost=2.45..2.59 rows=14 width=23) Group Key: , , , -> Hash Join (cost=1.09..2.31 rows=14 width=23) Hash Cond: ( = ) -> Seq Scan on emp a (cost=0.00..1.14 rows=14 width=14) -> Hash (cost=1.04..1.04 rows=4 width=13) -> Seq Scan on dept b (cost=0.00..1.04 rows=4 width=13) -> Materialize (cost=0.00..1.07 rows=5 width=12) -> Seq Scan on salgrade c (cost=0.00..1.05 rows=5 width=12) (11 rows)
3. Predicate push
After talking about view merging, let’s take a look at a situation that will occur when your view cannot be merged - predicate push. That is, for those views that cannot be merged and have predicates to filter, the CBO will push the predicate filtering conditions into the view, in order to filter out useless data as soon as possible, thereby improving performance.
From the perspective of CBO, it is naturally good to perform predicate pushing, because unwanted data can be filtered out in advance. But if the pushed predicate is connected to the column, it may cause the table's join to change and SQL performance becomes worse.
As shown in SQL below:
The outer predicate between and pushes into the view.
bill=# create or replace view v1 as select ename,sal,, bill-# from emp a,dept b bill-# where = ; CREATE VIEW bill=# explain select d.*, from v1 d,salgrade c bill-# where between and ; QUERY PLAN ----------------------------------------------------------------------------- Hash Join (cost=1.09..4.56 rows=8 width=27) Hash Cond: ( = ) -> Nested Loop (cost=0.00..3.43 rows=8 width=18) Join Filter: (( >= ) AND ( <= )) -> Seq Scan on emp a (cost=0.00..1.14 rows=14 width=14) -> Materialize (cost=0.00..1.07 rows=5 width=12) -> Seq Scan on salgrade c (cost=0.00..1.05 rows=5 width=12) -> Hash (cost=1.04..1.04 rows=4 width=13) -> Seq Scan on dept b (cost=0.00..1.04 rows=4 width=13) (9 rows)
So how can we prevent predicate intrapolation? In Oracle, it can be implemented by closing the hidden parameter_push_join_predicate of the predicate push of the join column. So how should it be implemented in PostgreSQL?
Similar to the above, we can avoid this by curing the view, but generally it is not recommended to cure the view, as most predicates are good for performance when pushing predicates into most. For example, when we use limit in a view, it will cause predicates that cannot be pushed in, so it is generally not recommended to use limit in a view. Why? Because if the predicate is pushed, the value obtained by limit may be different and will have an impact on the result set, so it cannot be pushed in naturally, because any equivalent conversion of the optimizer can only be performed without changing the SQL result.
This is the end of this article about the introduction of PostgreSQL common optimization techniques. For more related PostgreSQL optimization techniques, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!