PostgreSQL SQL SQL optimization solution for IN, EXISTS, ANY/ALL, JOIN

Test environment:

postgres=# select version();       
                         version                        
---------------------------------------------------------------------------------------------------------
 PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row) 
postgres=#

Data preparation:

$ pgbench -i -s 10

postgres=# \d
       List of relations
 Schema |    Name    | Type | Owner 
--------+------------------+-------+----------
 public | pgbench_accounts | table | postgres
 public | pgbench_branches | table | postgres
 public | pgbench_history | table | postgres
 public | pgbench_tellers | table | postgres
(4 rows)
 
postgres=# select * from pgbench_accounts limit 1;
 aid | bid | abalance |                    filler                    
-----+-----+----------+--------------------------------------------------------------------------------------
  1 |  1 |    0 |                                          
(1 row)
 
postgres=# select * from pgbench_branches limit 1;
 bid | bbalance | filler
-----+----------+--------
  1 |    0 |
(1 row)
 
postgres=# select * from pgbench_history limit 1;
 tid | bid | aid | delta | mtime | filler
-----+-----+-----+-------+-------+--------
(0 rows)
 
postgres=# select * from pgbench_tellers limit 1;
 tid | bid | tbalance | filler
-----+-----+----------+--------
  1 |  1 |    0 |
(1 row)
 
postgres=# select * from pgbench_branches;
 bid | bbalance | filler
-----+----------+--------
  1 |    0 |
  2 |    0 |
  3 |    0 |
  4 |    0 |
  5 |    0 |
  6 |    0 |
  7 |    0 |
  8 |    0 |
  9 |    0 |
 10 |    0 |
(10 rows)
 
postgres=# update pgbench_branches set bbalance=4500000 where bid in (4,7);
UPDATE 2
postgres=#

IN statement

Query requirements: Find out how many accounts each branch with balance greater than 0 has in the table in pgbench_accounts

1.Use the IN clause

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  bid IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )
GROUP BY
  bid;

2. Use the ANY clause

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  bid = ANY ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )
GROUP BY
  bid;

3. Use the EXISTS clause

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )
GROUP BY
  bid;

4. Use INNER JOIN

SELECT
  count( aid ),
FROM
  pgbench_accounts a
  JOIN pgbench_branches b ON  = 
WHERE
   > 0
GROUP BY
  ;

When completing this query requirement, one might assume that exists and inner join performance may be better, as they can use the logic and optimization of two table joins. The IN and ANY clauses require subquery.

However, PostgreSQL (after version 10) is already smart enough to produce the same execution plan for the above four writing methods!

All the above writing methods will produce the same execution plan:

                                      QUERY PLAN                                      
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate (cost=23327.73..23330.26 rows=10 width=12) (actual time=97.199..99.014 rows=2 loops=1)
  Group Key: 
  -> Gather Merge (cost=23327.73..23330.06 rows=20 width=12) (actual time=97.191..99.006 rows=6 loops=1)
     Workers Planned: 2
     Workers Launched: 2
     -> Sort (cost=22327.70..22327.73 rows=10 width=12) (actual time=93.762..93.766 rows=2 loops=3)
        Sort Key: 
        Sort Method: quicksort Memory: 25kB
        Worker 0: Sort Method: quicksort Memory: 25kB
        Worker 1: Sort Method: quicksort Memory: 25kB
        -> Partial HashAggregate (cost=22327.44..22327.54 rows=10 width=12) (actual time=93.723..93.727 rows=2 loops=3)
           Group Key: 
           -> Hash Join (cost=1.14..22119.10 rows=41667 width=8) (actual time=24.024..83.263 rows=66667 loops=3)
              Hash Cond: ( = )
              -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.023..43.151 rows=333333 loops=3)
              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.027..0.028 rows=2 loops=3)
                 Buckets: 1024 Batches: 1 Memory Usage: 9kB
                 -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=1 width=4) (actual time=0.018..0.020 rows=2 loops=3)
                    Filter: (bbalance > 0)
                    Rows Removed by Filter: 8
 Planning Time: 0.342 ms
 Execution Time: 99.164 ms
(22 rows)

So, can we conclude that we can write queries at will, and PostgreSQL's intelligence will handle the rest of the problem?!

etc!

If we consider exclusion, things will become different.

Exclude query

Query requirements: Find out how many accounts each branch with balance is not greater than 0 in the table in pgbench_accounts

1. Use NOT IN

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  bid NOT IN ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )
GROUP BY
  bid;

Implementation plan:

                                    QUERY PLAN                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate (cost=23645.42..23647.95 rows=10 width=12) (actual time=128.606..130.502 rows=8 loops=1)
  Group Key: pgbench_accounts.bid
  -> Gather Merge (cost=23645.42..23647.75 rows=20 width=12) (actual time=128.598..130.490 rows=24 loops=1)
     Workers Planned: 2
     Workers Launched: 2
     -> Sort (cost=22645.39..22645.42 rows=10 width=12) (actual time=124.960..124.963 rows=8 loops=3)
        Sort Key: pgbench_accounts.bid
        Sort Method: quicksort Memory: 25kB
        Worker 0: Sort Method: quicksort Memory: 25kB
        Worker 1: Sort Method: quicksort Memory: 25kB
        -> Partial HashAggregate (cost=22645.13..22645.23 rows=10 width=12) (actual time=124.917..124.920 rows=8 loops=3)
           Group Key: pgbench_accounts.bid
           -> Parallel Seq Scan on pgbench_accounts (cost=1.13..21603.46 rows=208333 width=8) (actual time=0.078..83.134 rows=266667 loops=3)
              Filter: (NOT (hashed SubPlan 1))
              Rows Removed by Filter: 66667
              SubPlan 1
               -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.020..0.021 rows=2 loops=3)
                  Filter: (bbalance > 0)
                  Rows Removed by Filter: 8
 Planning Time: 0.310 ms
 Execution Time: 130.620 ms
(21 rows)
 
postgres=#

2. Use <>ALL

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  bid <> ALL ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 )
GROUP BY
  bid;

Implementation plan:

                                     QUERY PLAN                                    
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate (cost=259581.79..259584.32 rows=10 width=12) (actual time=418.220..419.913 rows=8 loops=1)
  Group Key: pgbench_accounts.bid
  -> Gather Merge (cost=259581.79..259584.12 rows=20 width=12) (actual time=418.212..419.902 rows=24 loops=1)
     Workers Planned: 2
     Workers Launched: 2
     -> Sort (cost=258581.76..258581.79 rows=10 width=12) (actual time=413.906..413.909 rows=8 loops=3)
        Sort Key: pgbench_accounts.bid
        Sort Method: quicksort Memory: 25kB
        Worker 0: Sort Method: quicksort Memory: 25kB
        Worker 1: Sort Method: quicksort Memory: 25kB
        -> Partial HashAggregate (cost=258581.50..258581.60 rows=10 width=12) (actual time=413.872..413.875 rows=8 loops=3)
           Group Key: pgbench_accounts.bid
           -> Parallel Seq Scan on pgbench_accounts (cost=0.00..257539.83 rows=208333 width=8) (actual time=0.054..367.244 rows=266667 loops=3)
              Filter: (SubPlan 1)
              Rows Removed by Filter: 66667
              SubPlan 1
               -> Materialize (cost=0.00..1.13 rows=1 width=4) (actual time=0.000..0.001 rows=2 loops=1000000)
                  -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.001..0.001 rows=2 loops=337880)
                     Filter: (bbalance > 0)
                     Rows Removed by Filter: 8
 Planning Time: 0.218 ms
 Execution Time: 420.035 ms
(22 rows) 
postgres=#

3. Use NOT EXISTS

SELECT
  count( aid ),bid
FROM
  pgbench_accounts
WHERE
  NOT EXISTS ( SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid )
GROUP BY
  bid;

Implementation plan:

                                      QUERY PLAN                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate (cost=28327.72..28330.25 rows=10 width=12) (actual time=152.024..153.931 rows=8 loops=1)
  Group Key: pgbench_accounts.bid
  -> Gather Merge (cost=28327.72..28330.05 rows=20 width=12) (actual time=152.014..153.917 rows=24 loops=1)
     Workers Planned: 2
     Workers Launched: 2
     -> Sort (cost=27327.70..27327.72 rows=10 width=12) (actual time=147.782..147.786 rows=8 loops=3)
        Sort Key: pgbench_accounts.bid
        Sort Method: quicksort Memory: 25kB
        Worker 0: Sort Method: quicksort Memory: 25kB
        Worker 1: Sort Method: quicksort Memory: 25kB
        -> Partial HashAggregate (cost=27327.43..27327.53 rows=10 width=12) (actual time=147.732..147.737 rows=8 loops=3)
           Group Key: pgbench_accounts.bid
           -> Hash Anti Join (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.134..101.884 rows=266667 loops=3)
              Hash Cond: (pgbench_accounts.bid = pgbench_branches.bid)
              -> Parallel Seq Scan on pgbench_accounts (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.032..45.174 rows=333333 loops=3)
              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.036..0.037 rows=2 loops=3)
                 Buckets: 1024 Batches: 1 Memory Usage: 9kB
                 -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=1 width=4) (actual time=0.025..0.027 rows=2 loops=3)
                    Filter: (bbalance > 0)
                    Rows Removed by Filter: 8
 Planning Time: 0.322 ms
 Execution Time: 154.040 ms
(22 rows) 
postgres=#

4. Use LEFT JOIN and IS NULL

SELECT
  count( aid ),
FROM
  pgbench_accounts a
  LEFT JOIN pgbench_branches b ON  =  AND  > 0
WHERE
   IS NULL
GROUP BY
  ;

Implementation plan:

                                      QUERY PLAN                                      
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate (cost=28327.72..28330.25 rows=10 width=12) (actual time=145.298..147.096 rows=8 loops=1)
  Group Key: 
  -> Gather Merge (cost=28327.72..28330.05 rows=20 width=12) (actual time=145.288..147.083 rows=24 loops=1)
     Workers Planned: 2
     Workers Launched: 2
     -> Sort (cost=27327.70..27327.72 rows=10 width=12) (actual time=141.883..141.887 rows=8 loops=3)
        Sort Key: 
        Sort Method: quicksort Memory: 25kB
        Worker 0: Sort Method: quicksort Memory: 25kB
        Worker 1: Sort Method: quicksort Memory: 25kB
        -> Partial HashAggregate (cost=27327.43..27327.53 rows=10 width=12) (actual time=141.842..141.847 rows=8 loops=3)
           Group Key: 
           -> Hash Anti Join (cost=1.14..25452.43 rows=375000 width=8) (actual time=0.087..99.535 rows=266667 loops=3)
              Hash Cond: ( = )
              -> Parallel Seq Scan on pgbench_accounts a (cost=0.00..20560.67 rows=416667 width=8) (actual time=0.025..44.337 rows=333333 loops=3)
              -> Hash (cost=1.12..1.12 rows=1 width=4) (actual time=0.026..0.027 rows=2 loops=3)
                 Buckets: 1024 Batches: 1 Memory Usage: 9kB
                 -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=1 width=4) (actual time=0.019..0.020 rows=2 loops=3)
                    Filter: (bbalance > 0)
                    Rows Removed by Filter: 8
 Planning Time: 0.231 ms
 Execution Time: 147.180 ms
(22 rows) 
postgres=#

Both NOT IN and <> ALL generation execution plan contain a subquery. They are independent.

NOT EXISTS and LEFT JOIN generate the same execution plan.

These hash connections (or hash anti joins) are the most flexible way to complete query requirements. This is also the reason why exists or joins are recommended. Therefore, the rule of thumb that recommends using exists or join is effective.

However, let's continue to look down! Even with the subquery execution plan, will the execution time of the NOT IN clause be better?

Yes. PostgreSQL has done excellent optimizations, and PostgreSQL hash the subquery plan. Therefore PostgreSQL has a better understanding of how to handle IN clauses, which is a logical way of thinking because many people tend to use IN clauses. The subquery returns few rows, but the same happens even if the subquery returns hundreds of rows.

But what if the subquery returns a large number of rows (hundreds of thousands of rows)? Let's try a simple test:

CREATE TABLE t1 AS
SELECT * FROM generate_series(0, 500000) id;
 
CREATE TABLE t2 AS
SELECT (random() * 4000000)::integer id
FROM generate_series(0, 4000000);
 
ANALYZE t1;
ANALYZE t2;
 
EXPLAIN SELECT id
FROM t1
WHERE id NOT IN (SELECT id FROM t2);

Implementation plan:

    QUERY PLAN                 
--------------------------------------------------------------------------------
 Gather (cost=1000.00..15195064853.01 rows=250000 width=4)
  Workers Planned: 1
  -> Parallel Seq Scan on t1 (cost=0.00..15195038853.01 rows=147059 width=4)
     Filter: (NOT (SubPlan 1))
     SubPlan 1
      -> Materialize (cost=0.00..93326.01 rows=4000001 width=4)
         -> Seq Scan on t2 (cost=0.00..57700.01 rows=4000001 width=4)
(7 rows)
 
postgres=#

Here, the execution plan materializes the subquery. The cost assessment became 15195038853.01. (The default setting of PostgreSQL, if the rows of the t2 table are less than 100k, the subquery will be hashed). This will seriously affect performance. Therefore, for scenarios where the number of rows returned by the subquery is small, the IN clause can play a good role.

Other points of attention

some! When we write queries in different ways, there may be conversions of data types.

For example, statement:

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M', 'F']);

Implicit type conversion occurs:

Seq Scan on emp (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1)
 Filter: ((gen)::text = ANY ('{M,F}'::text[]))

Here (gen)::text type conversion occurs. This type conversion will be expensive if it is on large tables, so PostgreSQL handles the IN clause better.

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');
 
 Seq Scan on emp (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)
  Filter: (gen = ANY ('{M,F}'::bpchar[]))

The IN clause was converted to an ANY clause, and no gen column was type converted. Instead, M\F is converted to bpchar (internal equivalent to char)

Summarize

Simply put, exists and direct join tables are usually better.

In many cases, PostgreSQL replaces the IN clause with a subplan that is hashed. In some special scenarios, IN can obtain a better execution plan.

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.