Like query optimization solution in postgresql

When the quantity was relatively large, the efficiency of fuzzy query was very slow. In order to optimize the query efficiency, try the following methods to compare the efficiency.

1. Comparative situation description:

1. Data volume 100w pieces of data

2. Execute sql

2. Comparison results

explain analyze SELECT
 c_patent,
 c_applyissno,
 d_applyissdate,
 d_applydate,
 c_patenttype_dimn,
 c_newlawstatus,
 c_abstract 
FROM
 public.t_knowl_patent_zlxx_temp 
WHERE
 c_applicant LIKE '%Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County%';

1. Implementation plan when no search is built:

"Gather (cost=1000.00..83803.53 rows=92 width=1278) (actual time=217.264..217.264 rows=0 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -&gt; Parallel Seq Scan on t_knowl_patent_zlxx (cost=0.00..82794.33 rows=38 width=1278) (actual time=212.355..212.355 rows=0 loops=3)
  Filter: ((c_applicant)::text ~~ '%Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County%'::text)
  Rows Removed by Filter: 333333
Planning time: 0.272 ms
Execution time: 228.116 ms"

2. btree index

Indexing statement

CREATE INDEX idx_public_t_knowl_patent_zlxx_applicant ON public.t_knowl_patent_zlxx(c_applicant varchar_pattern_ops);

Execution plan

"Gather (cost=1000.00..83803.53 rows=92 width=1278) (actual time=208.253..208.253 rows=0 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -&gt; Parallel Seq Scan on t_knowl_patent_zlxx (cost=0.00..82794.33 rows=38 width=1278) (actual time=203.573..203.573 rows=0 loops=3)
  Filter: ((c_applicant)::text ~~ '%Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County%'::text)
  Rows Removed by Filter: 333333
Planning time: 0.116 ms
Execution time: 218.189 ms"

But if you change the query SQL slightly, remove the pre-% in the like query.

Index Scan using idx_public_t_knowl_patent_zlxx_applicant on t_knowl_patent_zlxx_temp (cost=0.55..8.57 rows=92 width=1278) (actual time=0.292..0.292 rows=0 loops=1)
 Index Cond: (((c_applicant)::text ~&gt;=~ 'An Ping'an Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County'::text) AND ((c_applicant)::text ~&lt;~ 'An Ping'an Breeding Cities in Lianshanguan Town, Benxi Manchu Autonomous County'::text))
 Filter: ((c_applicant)::text ~~ 'Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County'::text)
Planning time: 0.710 ms
Execution time: 0.378 ms

3. Gin index

Create index statements (postgresql requires version 9.6 and above)

create extension pg_trgm;
CREATE INDEX idx_public_t_knowl_patent_zlxx_applicant ON public.t_knowl_patent_zlxx USING gin (c_applicant gin_trgm_ops);

Execution plan

Bitmap Heap Scan on t_knowl_patent_zlxx (cost=244.71..600.42 rows=91 width=1268) (actual time=0.649..0.649 rows=0 loops=1)
 Recheck Cond: ((c_applicant)::text ~~ '%Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County%'::text)
 -&gt; Bitmap Index Scan on idx_public_t_knowl_patent_zlxx_applicant (cost=0.00..244.69 rows=91 width=0) (actual time=0.647..0.647 rows=0 loops=1)
  Index Cond: ((c_applicant)::text ~~ '%Anping An Breeding Farm in Lianshanguan Town, Benxi Manchu Autonomous County%'::text)
Planning time: 0.673 ms
Execution time: 0.740 ms

3. Conclusion

btree index allows fuzzy matching of the post-% "abc%" to go through the index, and gin + gp_trgm allows % "%abc%" to go through the index. However, gin indexes also have disadvantages, and the following situations may cause hits to fail:

When the search field is less than 3 characters, the index will not be hit, which is caused by gin's own mechanism.

When the search field is too long, such as email search, the index may not be hit, and the reason is currently unknown.

Supplement: PostgreSQL LIKE query efficiency improvement experiment

1. Query efficiency without indexing

For comparison, test the unindexed query first

EXPLAIN ANALYZE select * from gallery_map where author = 'Cao Zhiyun';
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=1025 width=621) (actual time=0.011..39.753 rows=1031 loops=1)
 Filter: ((author)::text = 'Cao Zhiyun'::text)
 Rows Removed by Filter: 71315
 Planning time: 0.194 ms
 Execution time: 39.879 ms
(5 rows)
 
Time: 40.599 ms
EXPLAIN ANALYZE select * from gallery_map where author like 'Cao Zhiyun';
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=1025 width=621) (actual time=0.017..41.513 rows=1031 loops=1)
 Filter: ((author)::text ~~ 'Cao Zhiyun'::text)
 Rows Removed by Filter: 71315
 Planning time: 0.188 ms
 Execution time: 41.669 ms
(5 rows)
 
Time: 42.457 ms
 
EXPLAIN ANALYZE select * from gallery_map where author like 'Cao Zhiyun%';
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=1028 width=621) (actual time=0.017..41.492 rows=1031 loops=1)
 Filter: ((author)::text ~~ 'Cao Zhiyun%'::text)
 Rows Removed by Filter: 71315
 Planning time: 0.307 ms
 Execution time: 41.633 ms
(5 rows)
 
Time: 42.676 ms

Obviously, they will do full table scans

2. Create a btree index

PostgreSQL default index is btree

CREATE INDEX ix_gallery_map_author ON gallery_map (author);
 
EXPLAIN ANALYZE select * from gallery_map where author = 'Cao Zhiyun';  
                QUERY PLAN                
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on gallery_map (cost=36.36..2715.37 rows=1025 width=621) (actual time=0.457..1.312 rows=1031 loops=1)
 Recheck Cond: ((author)::text = 'Cao Zhiyun'::text)
 Heap Blocks: exact=438
 -&gt; Bitmap Index Scan on ix_gallery_map_author (cost=0.00..36.10 rows=1025 width=0) (actual time=0.358..0.358 rows=1031 loops=1)
   Index Cond: ((author)::text = 'Cao Zhiyun'::text)
 Planning time: 0.416 ms
 Execution time: 1.422 ms
(7 rows)
 
Time: 2.462 ms
 
EXPLAIN ANALYZE select * from gallery_map where author like 'Cao Zhiyun';
                QUERY PLAN                
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on gallery_map (cost=36.36..2715.37 rows=1025 width=621) (actual time=0.752..2.119 rows=1031 loops=1)
 Filter: ((author)::text ~~ 'Cao Zhiyun'::text)
 Heap Blocks: exact=438
 -&gt; Bitmap Index Scan on ix_gallery_map_author (cost=0.00..36.10 rows=1025 width=0) (actual time=0.560..0.560 rows=1031 loops=1)
   Index Cond: ((author)::text = 'Cao Zhiyun'::text)
 Planning time: 0.270 ms
 Execution time: 2.295 ms
(7 rows)
 
Time: 3.444 ms
EXPLAIN ANALYZE select * from gallery_map where author like 'Cao Zhiyun%';
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=1028 width=621) (actual time=0.015..41.389 rows=1031 loops=1)
 Filter: ((author)::text ~~ 'Cao Zhiyun%'::text)
 Rows Removed by Filter: 71315
 Planning time: 0.260 ms
 Execution time: 41.518 ms
(5 rows)
 
Time: 42.430 ms
EXPLAIN ANALYZE select * from gallery_map where author like '% Research Room';
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=2282 width=621) (actual time=0.064..52.824 rows=2152 loops=1)
 Filter: ((author)::text ~~ '% Research Room'::text)
 Rows Removed by Filter: 70194
 Planning time: 0.254 ms
 Execution time: 53.064 ms
(5 rows)
 
Time: 53.954 ms

You can see that whether the full match of equals and like is used for indexing, like fuzzy query or full table scanning

3. Create a gin index

CREATE EXTENSION pg_trgm;
 
CREATE INDEX ix_gallery_map_author ON gallery_map USING gin (author gin_trgm_ops);
EXPLAIN ANALYZE select * from gallery_map where author like 'Cao%'; 
                QUERY PLAN                
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on gallery_map (cost=19.96..2705.69 rows=1028 width=621) (actual time=0.419..1.771 rows=1031 loops=1)
 Recheck Cond: ((author)::text ~~ 'Cao%'::text)
 Heap Blocks: exact=438
 -&gt; Bitmap Index Scan on ix_gallery_map_author (cost=0.00..19.71 rows=1028 width=0) (actual time=0.312..0.312 rows=1031 loops=1)
   Index Cond: ((author)::text ~~ 'Cao%'::text)
 Planning time: 0.358 ms
 Execution time: 1.916 ms
(7 rows)
 
Time: 2.843 ms
EXPLAIN ANALYZE select * from gallery_map where author like '%Win%'; 
             QUERY PLAN             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on gallery_map (cost=0.00..7002.32 rows=1028 width=621) (actual time=0.015..51.641 rows=1031 loops=1)
 Filter: ((author)::text ~~ '%Win%'::text)
 Rows Removed by Filter: 71315
 Planning time: 0.268 ms
 Execution time: 51.957 ms
(5 rows)
 
Time: 52.899 ms
EXPLAIN ANALYZE select * from gallery_map where author like '%Learning Room%';
                QUERY PLAN                
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on gallery_map (cost=31.83..4788.42 rows=2559 width=621) (actual time=0.914..4.195 rows=2402 loops=1)
 Recheck Cond: ((author)::text ~~ '%Learning Room%'::text)
 Heap Blocks: exact=868
 -&gt; Bitmap Index Scan on ix_gallery_map_author (cost=0.00..31.19 rows=2559 width=0) (actual time=0.694..0.694 rows=2402 loops=1)
   Index Cond: ((author)::text ~~ '%Learning Room%'::text)
 Planning time: 0.306 ms
 Execution time: 4.403 ms
(7 rows)
 
Time: 5.227 ms

The gin_trgm index has much better effect

Since the index of pg_trgm cuts the string into multiple 3 tuples and then uses these 3 tuples to match, the gin_trgm index only has the prefix matching to the index for queries with less than 3 characters (including Chinese characters).

In addition, btree_gin was also tested, the effect is the same as btree

Notice:

gin_trgm requires that the database must be encoded using UTF-8

demo_v1 # \l demo_v1
        List of databases
 Name | Owner | Encoding | Collate | Ctype | Access privileges
---------+-----------+----------+-------------+-------------+-------------------
 demo_v1 | wmpp_user | UTF8  | en_US.UTF-8 | en_US.UTF-8 |

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.