ArteryBase-fuzzy matching killer
Problem background
As pg becomes more and more powerful, abase has been upgraded to 5.0 (postgresql10.4). Currently, abase5.0 inherits the full-text retrieval plug-in (zhparser), making it more and more convenient to use full-text retrieval. This article will explain the like fuzzy matching supported by abase, search the full text, what kind of index to create, and how to use it. For all kinds of fuzzy matches, it can be indexed
Front fuzzy match (%xxx), back fuzzy match (xxx%)
Use scenario: If you have simple front fuzzy matches or back fuzzy matches, you can create a simple btree index.
--1.Post-fuzzy matching(xxx%) create index i_t_msys_btrre_c_ajmc on db_msys.t_msys using btree(c_ajmc text_pattern_ops); CREATE INDEX Time: 4189.886 ms (00:04.190) db_15fb=# select c_ajmc from db_msys.t_msys where c_ajmc like 'Beijing%'; c_ajmc ------------------------ Beijing decides to go with Huayu Beijing and Huayu Information Beijing Beijing华宇,Beijing华宇 、、、 db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where c_ajmc like 'Beijing%'; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=43.92..2177.91 rows=4204 width=80) (actual time=0.570..2.667 rows=1570 loops=1) Filter: ((c_ajmc)::text ~~ 'Beijing%'::text) Heap Blocks: exact=500 -> Bitmap Index Scan on i_t_msys_btrre_c_ajmc (cost=0.00..42.87 rows=632 width=0) (actual time=0.477..0.477 rows=1570 loops=1) Index Cond: (((c_ajmc)::text ~>=~ 'Beijing'::text) AND ((c_ajmc)::text ~<~ 'Beiting'::text)) Planning time: 0.956 ms Execution time: 2.841 ms (7 rows) Time: 4.848 ms --2.Front fuzzy match(%xxx),Query withc_ajmcRecords ending with information,Use the inversion functionreverse db_15fb=# create index i_t_msys_reverse_c_ajmc on db_msys.t_msys using btree(reverse(c_ajmc) text_pattern_ops); CREATE INDEX Time: 4011.131 ms (00:04.011) --Query with张三结尾的信息 db_15fb=# select c_ajmc from db_msys.t_msys where reverse(c_ajmc) like reverse('%Zhang San'); c_ajmc ---------- Huayu Zhang San Beijing张三 (2 rows) Time: 0.910 ms --Front fuzzy match也可走索引 db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where reverse(c_ajmc) like reverse('%Zhang San'); QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=114.86..5312.88 rows=1787 width=80) (actual time=0.064..0.066 rows=2 loops=1) Filter: (reverse((c_ajmc)::text) ~~ 'Three photos %'::text) Heap Blocks: exact=1 -> Bitmap Index Scan on i_t_msys_reverse_c_ajmc (cost=0.00..114.42 rows=1787 width=0) (actual time=0.042..0.042 rows=2 loops=1) Index Cond: ((reverse((c_ajmc)::text) ~>=~ 'Three photos'::text) AND (reverse((c_ajmc)::text) ~<~ 'Three Balls'::text)) Planning time: 0.236 ms Execution time: 0.148 ms (7 rows) Time: 1.211 ms --Or uselike 'Three photos %'Equivalent to reverse('%Zhang San') db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where reverse(c_ajmc) like 'three photos %'; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=114.86..5312.88 rows=1787 width=80) (actual time=0.056..0.058 rows=2 loops=1) Filter: (reverse((c_ajmc)::text) ~~ 'Three photos %'::text) Heap Blocks: exact=1 -> Bitmap Index Scan on i_t_msys_reverse_c_ajmc (cost=0.00..114.42 rows=1787 width=0) (actual time=0.036..0.036 rows=2 loops=1) Index Cond: ((reverse((c_ajmc)::text) ~>=~ 'Three photos'::text) AND (reverse((c_ajmc)::text) ~<~ 'Three Balls'::text)) Planning time: 0.259 ms Execution time: 0.108 ms (7 rows) Time: 1.119 ms
The principle of front fuzzy matching is to invert the data and store the data. The fields need to be inverted during query, and the input values also need to be inverted. The principle is the same as the previous fuzzy matching.
Full fuzzy match (%xxx%)-triple match pg_trgm
Usage scenario: pg_trgm supports front fuzzy matching, back fuzzy matching and full fuzzy matching, but full fuzzy matching requires at least three characters to be indexed, and only takes effect in scenarios where full fuzzy matching is no less than three characters (abase one Chinese character is one character), that is, like '%xxx%' cannot be less than three Chinese characters.
pg_trgmExtensionsabaseIt also comes with it,If you can't use it, you can try to delete the extension first,Then create the extension --View Installation Extensions db_sqlfx=# select * from pg_extension; extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition --------------+----------+--------------+----------------+------------+-- plpgsql | 10 | 11 | f | 1.0 | | uuid-ossp | 10 | 2200 | t | 1.1 | | adminpack | 10 | 11 | f | 1.1 | | postgres_fdw | 10 | 2200 | t | 1.0 | | file_fdw | 10 | 2200 | t | 1.0 | | pg_prewarm | 10 | 2200 | t | 1.1 | | btree_gin | 10 | 2200 | t | 1.2 | | zhparser | 10 | 2200 | t | 1.0 | | pg_trgm | 10 | 2200 | t | 1.3 | | (9 rows) --If not, you can create an extension: create extension pg_trgm; --Remove the extension drop extension pg_trgm; --c_ajmccreateginindex db_15fb=# create index i_t_msys_gin_c_ajmc on db_msys.t_msys using gin(c_ajmc gin_trgm_ops); CREATE INDEX Time: 25013.192 ms (00:25.013) --Query'Dongting Lake' db_15fb=# select c_ajmc from db_msys.t_msys where c_ajmc like ('%Dongting Lake%'); c_ajmc ---------------- Test Dongting Lake data (1 row) Time: 1.005 ms --全模糊匹配可走index db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where c_ajmc like ('%Dongting Lake%'); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=24.27..159.92 rows=35 width=80) (actual time=0.088..0.088 rows=1 loops=1) Recheck Cond: ((c_ajmc)::text ~~ '%Dongting Lake%'::text) Heap Blocks: exact=1 -> Bitmap Index Scan on i_t_msys_gin_c_ajmc (cost=0.00..24.27 rows=35 width=0) (actual time=0.069..0.069rows=1 loops=1) Index Cond: ((c_ajmc)::text ~~ '%Dongting Lake%'::text) Planning time: 0.404 ms Execution time: 0.152 ms (7 rows) Time: 1.263 ms --Post-fuzzy matching,You need to delete the previous one firstbtree,By default, it will leavebtreeBecause of the cost ratioginLow,(It should be noted thatpg_trgm的Post-fuzzy matching至少需要提供一个字符才会走,Two characters are required before fuzzy matching) drop index i_t_msys_btrre_c_ajmc; db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where c_ajmc like ('Beijing%'); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=72.58..9791.59 rows=4204 width=80) (actual time=1.058..4.993 rows=1570 loo ps=1) Recheck Cond: ((c_ajmc)::text ~~ 'Beijing%'::text) Rows Removed by Index Recheck: 855 Heap Blocks: exact=989 -> Bitmap Index Scan on i_t_msys_gin_c_ajmc (cost=0.00..71.53 rows=4204 width=0) (actual time=0.869..0.8 69 rows=2425 loops=1) Index Cond: ((c_ajmc)::text ~~ 'Beijing%'::text) Planning time: 0.589 ms Execution time: 5.160 ms (8 rows) Time: 6.658 ms --useginindex Front fuzzy match db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where c_ajmc like ('% contract dispute'); QUERY PLAN ------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=1220.09..19633.34 rows=126980 width=80) (actual time=62.980..298.705 rows=166872 loops=1) Recheck Cond: ((c_ajmc)::text ~~ '% Contract Dispute'::text) Rows Removed by Index Recheck: 12 Heap Blocks: exact=16654 -> Bitmap Index Scan on i_t_msys_gin_c_ajmc (cost=0.00..1188.35 rows=126980 width=0) (actual time=58.905..58.905 rows=166886 loops=1) Index Cond: ((c_ajmc)::text ~~ '% Contract Dispute'::text) Planning time: 0.623 ms Execution time: 309.385 ms (8 rows) Time: 311.072 ms --usebtreeThe inversion function db_15fb=# explain analyze select c_ajmc from db_msys.t_msys where reverse(c_ajmc) like reverse('% contract dispute'); QUERY PLAN------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=114.86..5312.88 rows=1787 width=80) (actual time=51.135..289.537 rows=166872 loops=1) Filter: (reverse((c_ajmc)::text) ~~ 'Dispute Contract%'::text) Heap Blocks: exact=16654 -> Bitmap Index Scan on i_t_msys_reverse_c_ajmc (cost=0.00..114.42 rows=1787 width=0) (actual time=46.970..46.970 rows=166874 loops=1) Index Cond: ((reverse((c_ajmc)::text) ~>=~ 'Dispute and Contract'::text) AND (reverse((c_ajmc)::text) ~<~ 'Auspicious dispute'::text)) Planning time: 0.268 ms Execution time: 301.174 ms (7 rows) Time: 302.413 ms 可以看出Front fuzzy matchuseginandbtree都可以走index,ginandbtreeThe efficiency is not much different,butginindexbig小比btreebig,且create耗费时间
The front and back fuzzy matching of pg_trgm extension can also be indexed, and the back fuzzy matching btree is more efficient than gin.
Full text search-zhparser
Use scenarios: single field full text search, multi-field full text search, row-level full text search
Currently, abase5.0 comes with full-text retrieval support. You can see the extension of zhparser using select * from pg_extension. Before abase5.0, you need to install it manually
--View Installation Extensions db_sqlfx=# select * from pg_extension; extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition --------------+----------+--------------+----------------+------------+-----------+-------------- plpgsql | 10 | 11 | f | 1.0 | | uuid-ossp | 10 | 2200 | t | 1.1 | | adminpack | 10 | 11 | f | 1.1 | | postgres_fdw | 10 | 2200 | t | 1.0 | | file_fdw | 10 | 2200 | t | 1.0 | | pg_prewarm | 10 | 2200 | t | 1.1 | | btree_gin | 10 | 2200 | t | 1.2 | | zhparser | 10 | 2200 | t | 1.0 | | pg_trgm | 10 | 2200 | t | 1.3 | | (9 rows) --If not, you can create an extension: db_15fb=# create extension zhparser; CREATE EXTENSION --Create and usezhparserConfiguration as a parser full text search db_15fb=# create text search configuration testzhcfg(parser = zhparser); CREATE TEXT SEARCH CONFIGURATION --Add to full text search configurationtokenMapping db_15fb=# alter text search configuration testzhcfg add mapping for n,v,a,i,e,l with simple; ALTER TEXT SEARCH CONFIGURATION The abovetokenMapping只Mapping了名词(n),verb(v),adjective(a),idiom(i),interjection(e)and customary terms(l)6kind,this6kind以外的tokenAll blocked。The dictionary uses built-insimpledictionary,That is, only lowercase conversion。根据需要可以灵活定义dictionary和tokenMapping,To realize functions such as blocking words and synonyms。 --Word participle effect db_15fb=# select to_tsvector('testzhcfg','Nanjing Yangtze River Bridge'); to_tsvector ---------------------------------------------------------------------------------------- 'Nanjing':2 'Nanjing':1 'big':9 'Bridge':6 'city':3 'bridge':10 'River':8 'long':7 'Yangtze':5 'Yangtze River Bridge':4 (1 row)
Full text search query
--c_ajmcCreate an index,You can see that the creationginIndex comparisonbtreeIt's time-consuming db_15fb=# create index i_t_msys_c_ajmc on db_msys.t_msys using gin(to_tsvector('testzhcfg',c_ajmc)); CREATE INDEX Time: 32601.072 ms (00:32.601) --Queryc_ajmcIncluding Beijing Huayu,to_tsquery('testzhcfg','Beijing Huayu') db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing Huayu'); c_baah | c_ajmc ---------------+---------------------- Huayu | Beijing决定andHuayu Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu信息 (2018)xxxxxx1 | Beijing出席Huayu科技 Test case numberHuayu | BeijingHuayu (5 rows) Time: 1.927 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing Huayu'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=56.00..60.26 rows=1 width=106) (actual time=0.989..1.004 rows=3 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' & ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Heap Blocks: exact=5 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..56.00 rows=1 width=0) (actual time=0.971..0.971 rows=13 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' & ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Planning time: 0.275 ms Execution time: 1.055 ms (7 rows) Time: 2.290 ms --to_tsquery('testzhcfg','Beijing Huayu')Equivalent toto_tsquery('testzhcfg','Beijing & Huayu') db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing&Huayu'); c_baah | c_ajmc ---------------+---------------------- Huayu | Beijing决定andHuayu Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu信息 (2018)xxxxxx1 | Beijing出席Huayu科技 Test case numberHuayu | BeijingHuayu (5 rows) Time: 2.037 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing&Huayu'); QUERY PLAN --------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=56.00..60.26 rows=1 width=106) (actual time=0.941..0.958 rows=5 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' & ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Heap Blocks: exact=5 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..56.00 rows=1 width=0) (actual time=0.921..0.921 rows=15 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' & ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Planning time: 0.295 ms Execution time: 1.008 ms (7 rows) Time: 2.070 ms --Include'Beijing'or'Huayu'of:to_tsquery('testzhcfg','Beijing|Huayu') db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing|Huayu'); c_baah | c_ajmc ---------------------------+-------------- Beijing Science and Technology Park | Huayu信息 Huayu | Beijing决定andHuayu Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu信息 (2017)xxThe beginning of the *xxNumber | Huayu (2017)xxThe beginning of the *xxNumber | Beijing ... Time: 10.426 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@to_tsquery('testzhcfg','Beijing|Huayu'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=69.85..5710.15 rows=1787 width=106) (actual time=2.269..7.338 rows=2941 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' | ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Heap Blocks: exact=1355 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..69.41 rows=1787 width=0) (actual time=2.034..2.034 rows=2954 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' | ''Huayu'' & ''magnificent'' & ''Yu'''::tsquery) Planning time: 0.268 ms Execution time: 7.565 ms (7 rows) Time: 8.655 ms 这里Queryof结果Include了BeijingandHuayu,如果想让只QueryInclude'Beijing'and'Huayu'中间不Include其他名词或verb等,Availablephraseto_tsquery,No matter what it is'Beijing Huayu','Beijing|Huayu','Beijing & Huayu'The results are all the same。 db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@phraseto_tsquery('testzhcfg','Beijing Huayu'); c_baah | c_ajmc --------------+---------------------- Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu信息 Test case numberHuayu | BeijingHuayu (3 rows) Time: 2.203 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@phraseto_tsquery('testzhcfg','Beijing Huayu'); QUERY PLAN ------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on t_msys (cost=56.00..60.26 rows=1 width=106) (actual time=1.147..1.258 rows=3 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' <-> ''Huayu'' <-> ''magnificent'' <-> ''Yu'''::tsquery) Rows Removed by Index Recheck: 2 Heap Blocks: exact=5 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..56.00 rows=1 width=0) (actual time=1.016..1.016 rows=15 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text) @@ '''Beijing'' <-> ''Huayu'' <-> ''magnificent'' <-> ''Yu'''::tsquery) Planning time: 0.333 ms Execution time: 1.307 ms (8 rows) 但是结果中Include了'Test the dispute between Beijing and Huayu xx','Beijing and Huayu Information',The reason istokenNouns are ignored in mapping(n),verb(v),adjective(a),idiom(i),interjection(e)and习用语(l)6种以外of词。If you need it, you can add it to it, so you can match it more accurately'Beijing Huayu' --不Include'and' ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple; db_sqlfx=# select to_tsvector('testzhcfg','Beijing and Huayu'); to_tsvector --------------------------------- 'Beijing':1 'magnificent':3 'Huayu':2 'Yu':4 (1 row) --将所有词性全部影射出后就就Include'and' ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z WITH simple; db_sqlfx=# select to_tsvector('testzhcfg','Beijing and Huayu'); to_tsvector ------------------------------------------------------ 'and':4 'Beijing':3 'north':2 'Beijing':1 'magnificent':6 'Huayu':5 'Yu':7 (1 row) --'Beijing'and'Huayu'中间不Include任何词,结果Include'Beijing Huayu' ALTER TEXT SEARCH CONFIGURATION testzhcfg alter MAPPING FOR a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z WITH simple; db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc)@@phraseto_tsquery('testzhcfg','Beijing Huayu'); c_baah | c_ajmc ------------------------+---------------------- (2017)Sichuan0191The beginning of the *3198Number | 测试BeijingHuayu信息技术 (2017)Sichuan0191The beginning of the *9022Number | 测试BeijingHuayuxxx (2 rows) Time: 1.347 ms
Multi-field full text search
The previous one is the full text search of a single field. How to do it if I need to do full text search of multiple fields?
For example, I need to query the row containing 'Beijing' in fields such as the case number, case name, etc. Since pg_trgm is a triple match, this place cannot be implemented.
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple; --Function index for multiple fields Create a function index,Need to modify the function to a stable state(immutable) db_15fb=# create or replace function f1(regconfig,text) returns tsvector as $$ select to_tsvector($1,$2); $$ language sql immutable strict; CREATE FUNCTION Time: 13.580 ms db_15fb=# create or replace function f1(text) returns tsvector as $$ select to_tsvector($1); $$ language sql immutable strict; CREATE FUNCTION Time: 17.822 ms db_15fb=# alter function record_out(record) immutable; ALTER FUNCTION Time: 20.823 ms db_15fb=# alter function textin(cstring) immutable; ALTER FUNCTION Time: 15.078 ms --createc_baah,c_ajmcField index db_15fb=# create index i_t_msys_ah_ajmc on db_msys.t_msys using gin(f1('testzhcfg'::regconfig,c_baah||c_ajmc::text)); CREATE INDEX Time: 38587.146 ms (00:38.587) --Queryc_baah,c_ajmcInclude'Beijing'and'Huayu'Records of,Equivalent toto_tsquery('testzhcfg','Beijing Huayu') ; to_tsquery('Beijing & Huayu')A little more efficient db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@ to_tsquery('Beijing&Huayu') ; c_baah | c_ajmc ---------------+---------------------- Beijing Science and Technology Park | Huayu Information Huayu | Beijing决定andHuayu Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu Information (2018)xxxxxx1 | Beijing出席Huayu科技 Test case numberHuayu | BeijingHuayu (6 rows) Time: 1.222 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@ to_tsquery('Beijing&Huayu') ; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=32.32..72.33 rows=9 width=106) (actual time=0.184..0.197 rows=6 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, ((c_baah)::text || (c_ajmc)::text)) @@ to_tsquery('Beijing & Huayu'::text)) Heap Blocks: exact=3 -> Bitmap Index Scan on i_t_msys_ah_ajmc (cost=0.00..32.32 rows=9 width=0) (actual time=0.163..0.163 rows=7 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, ((c_baah)::text || (c_ajmc)::text)) @@ to_tsquery('Beijing & Huayu'::text)) Planning time: 0.329 ms Execution time: 0.251 ms (7 rows) Time: 1.396 ms ----Queryc_baah,c_ajmcInclude'Beijing'or'Huayu'Records of db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@ to_tsquery('Beijing|Huayu') ; c_baah | c_ajmc ---------------------------+-------------------- Beijing Science and Technology Park | Huayu Information Huayu | Beijing决定andHuayu Test case number | 测试Beijing与Huayuxxdispute Song of the Fifth Ring Road in Beijing | BeijingandHuayu Information Beijing Olympics | Song of (2017)xxxxxThe beginning of the *xxxxNumber | Huayu (2017)xxxx1The beginning of the *xxxxNumber | Beijing ...Not shown completely Time: 9.965 ms db_15fb=# explain analyze select c_baah,c_ajmc from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@ to_tsquery('Beijing|Huayu') ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=63.89..10564.67 rows=3566 width=106) (actual time=1.104..6.190 rows=2942 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, ((c_baah)::text || (c_ajmc)::text)) @@ to_tsquery('Beijing|Huayu'::text)) Heap Blocks: exact=1353 -> Bitmap Index Scan on i_t_msys_ah_ajmc (cost=0.00..62.99 rows=3566 width=0) (actual time=0.853..0.853 rows=2944 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, ((c_baah)::text || (c_ajmc)::text)) @@ to_tsquery('Beijing|Huayu'::text)) Planning time: 0.285 ms Execution time: 6.429 ms (7 rows) Time: 7.670 ms --Query'Beijing Huayu' db_15fb=# select c_baah,c_ajmc from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@ phraseto_tsquery('testzhcfg','Beijing Huayu') ; c_baah | c_ajmc ------------------------+---------------------- (2017)Sichuan0191The beginning of the *3198Number | 测试BeijingHuayu Information技术 (2017)Sichuan0191The beginning of the *9022Number | 测试BeijingHuayuxxx (2 rows) Time: 1.786 ms
Line-level full text search[
For example, you need to find a match in all columns'Beijing'Value of uset_msys::textYou can convert lines into a large text。 --Create row-level full text search db_15fb=# create index i_t_msys_all on db_msys.t_msys using gin(f1('testzhcfg'::regconfig,t_msys::text)); CREATE INDEX Time: 128538.026 ms (02:08.538) --Query all fields containing'Beijing'The situation db_15fb=# select c_jksxcsmc,c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg'::regconfig,t_msys::text)@@to_tsquery('Beijing'); c_jksxcsmc | c_baah | c_ajmc ----------------------------+---------------------------+------------------------------------------------------------------------- NanjingxxInformation Technology Co., Ltd. | Beijing Science and Technology Park | Huayu Information NanjingxxInformation Technology Co., Ltd. | Huayu | 北京决定和Huayu NanjingxxInformation Technology Co., Ltd. | Test case number | 测试北京与Huayuxxdispute NanjingxxInformation Technology Co., Ltd. | Song of the Fifth Ring Road in Beijing | 北京和Huayu Information NanjingxxInformation Technology Co., Ltd. | Beijing Olympics | Song of 北京HuayuInformation Technology Co., Ltd. | Test data | Test data Test Beijing Information Technology | Test data | Test data ... Time: 10.382 ms db_15fb=# explain analyze select c_jksxcsmc,c_baah,c_ajmc from db_msys.t_msys where to_tsvector('testzhcfg'::regconfig,t_msys::text)@@to_tsquery('Beijing'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t_msys (cost=38.10..6134.09 rows=1787 width=146) (actual time=1.014..6.792 rows=2841 loops=1) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.*)::text) @@ to_tsquery('Beijing'::text)) Heap Blocks: exact=1281 -> Bitmap Index Scan on i_t_msys_all (cost=0.00..37.66 rows=1787 width=0) (actual time=0.788..0.788 rows=2843 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.*)::text) @@ to_tsquery('Beijing'::text)) Planning time: 0.312 ms Execution time: 7.056 ms (7 rows) Time: 8.364 ms
Weight sorting
How often does query terms appear in documents,How close is the term in the document,and the importance of the parts that appear in the document --c_ajmcSort by weight db_15fb=# select c_ajmc,ts_rank(to_tsvector('testzhcfg',c_ajmc),to_tsquery('testzhcfg','Beijing Huayu')) rankfrom db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc) @@to_tsquery('testzhcfg','Beijing Huayu') order by rank desc ; c_ajmc | rank ----------------------+---------- Beijing Huayu,Beijing Huayu | 0.910206 Test Beijing and Huayuxxdispute | 0.463622 Beijing and Huayu Information | 0.463622 Beijing Huayu | 0.463622 Beijing decides to go with Huayu | 0.457134 Beijing attended Huayu Technology | 0.457134 (6 rows) Time: 2.179 ms --c_baah,c_ajmcMulti-field weight sorting db_15fb=# select c_baah,c_ajmc,ts_rank(f1('testzhcfg'::regconfig,c_baah||c_ajmc::text),to_tsquery('Beijing & Huayu')) rankdb_15fb-# from db_msys.t_msys where f1('testzhcfg'::regconfig,c_baah||c_ajmc::text) @@to_tsquery('Beijing&Huayu')db_15fb-# order by rank desc; c_baah | c_ajmc | rank -------------------+----------------------+----------- Beijing Huayu,Beijing Huayu | Beijing Huayu,Beijing Huayu | 0.733734 Test case number Huayu | Beijing Huayu | 0.186813 Huayu | Beijing decides to go with Huayu | 0.185238 Song of the Fifth Ring Road in Beijing | Beijing and Huayu Information | 0.181526 Test case number | Test Beijing and Huayuxxdispute | 0.0991032 (2018)xxxxxx1 | Beijing attended Huayu Technology | 0.0973585 Beijing Science and Technology Park | Huayu信息 | 0.095243 (7 rows) Time: 2.038 ms --Query divorce information,Return result26610strip,time consuming849ms db_15fb=# explain (analyze,verbose,buffers) select c_ajmc,ts_rank(to_tsvector('testzhcfg',c_ajmc),to_tsquery('testzhcfg','divorce')) rankdb_15fb-# from db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc) @@to_tsquery('testzhcfg','divorce') order by rank desc limit 10db_15fb-# ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=48.53..48.53 rows=1 width=84) (actual time=849.020..849.023 rows=10 loops=1) Output: c_ajmc, (ts_rank(to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text), '''divorce'' & ''Leave'' & ''marriage'''::tsquery)) Buffers: shared hit=11372 -> Sort (cost=48.53..48.53 rows=1 width=84) (actual time=849.017..849.018 rows=10 loops=1) Output: c_ajmc, (ts_rank(to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text), '''divorce'' & ''Leave'' & ''marriage'''::tsquery)) Sort Key: (ts_rank(to_tsvector('testzhcfg'::regconfig, (t_msys.c_ajmc)::text), '''divorce'' & ''Leave'' & ''marriage'''::tsquery)) DESC Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=11372 -> Bitmap Heap Scan on db_msys.t_msys (cost=44.00..48.52 rows=1 width=84) (actual time=14.057..825.193 rows=26610 loops=1) Output: c_ajmc, ts_rank(to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text), '''divorce'' & ''Leave'' & ''marriage'''::tsquery) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.c_ajmc)::text) @@ '''divorce'' & ''Leave'' & ''marriage'''::tsquery) Heap Blocks: exact=11336 Buffers: shared hit=11372 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..44.00 rows=1 width=0) (actual time=11.260..11.260 rows=26610 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.c_ajmc)::text) @@ '''divorce'' & ''Leave'' & ''marriage'''::tsquery) Buffers: shared hit=36 Planning time: 0.384 ms Execution time: 849.099 ms (18 rows) Time: 850.649 ms --Query contract|dispute,return179308strip数据,time consuming10s db_15fb=# explain (analyze,verbose,buffers) select c_ajmc,ts_rank(to_tsvector('testzhcfg',c_ajmc),to_tsquery('testzhcfg','Contract|Dispute')) rankfrom db_msys.t_msys where to_tsvector('testzhcfg',c_ajmc) @@to_tsquery('testzhcfg','Contract|Dispute') ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on db_msys.t_msys (cost=80.00..84.52 rows=1 width=84) (actual time=148.596..10658.341 rows=179308 loops=1) Output: c_ajmc, ts_rank(to_tsvector('testzhcfg'::regconfig, (c_ajmc)::text), '''contract'' & ''combine'' & ''same'' | ''dispute'' & ''correct'' & ''Fan'''::tsquery) Recheck Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.c_ajmc)::text) @@ '''contract'' & ''combine'' & ''same'' | ''dispute'' & ''correct'' & ''Fan'''::tsquery) Heap Blocks: exact=16632 Buffers: shared hit=16811 -> Bitmap Index Scan on i_t_msys_c_ajmc (cost=0.00..80.00 rows=1 width=0) (actual time=144.298..144.298 rows=179310 loops=1) Index Cond: (to_tsvector('testzhcfg'::regconfig, (t_msys.c_ajmc)::text) @@ '''contract'' & ''combine'' & ''same'' | ''dispute'' & ''correct'' & ''Fan'''::tsquery) Buffers: shared hit=179 Planning time: 0.373 ms Execution time: 10695.288 ms (10 rows)
It can be seen that when the query result set is large, sorting is very time-consuming, because it requires querying each tsvector that matches the document. If a row is large, it may be stored in the toast table, which involves a large number of random accesses.
Disk io will rise. Unfortunately, this is almost impossible to avoid, as actual queries often lead to huge numbers of matches.
Table size: db_15fb=# select pg_size_pretty(pg_relation_size('t_msys')); pg_size_pretty ---------------- 131 MB (1 row) Time: 0.858 ms --Index usage | Index Type | Index name | Index size | Time-consuming to create | Scene | | :----------- | ----------------------- | -------- | -------------|--------------| | btree | i_t_msys_btrre_c_ajmc | 37MB | 4189.886 ms |Front fuzzy match | | btree-reverse| i_t_msys_reverse_c_ajmc | 37MB | 4011.131 ms |Post-fuzzy matching | | gin-pg_trgm |i_t_msys_gin_c_ajmc | 67MB | 25013.192 ms |Full fuzzy matching triplets| | gin-zhparser |i_t_msys_c_ajmc | 21MB | 32601.072 ms |Single field full text search | | gin-zhparser |i_t_msys_ah_ajmc | 25MB | 38587.146 ms |Multi-field full text search | | gin-zhparser |i_t_msys_all | 106MB | 128538.026 ms|Line-level full text search | Line-level full text search占用空间接近表达小,Creation is also time-consuming。
Conclusion
1. After fuzzy matching (xxx%), you can use btree to create an index, which is more efficient than gin index, using btree(c_ajmc text_pattern_ops).
2. Before fuzzy matching (%xxx), the efficiency of btree and gin is not much different, but gin creation takes time and gin is larger than btree index. Therefore, it is recommended to use the btree reverse function to create an index. using btree(reverse(c_ajmc))
3. Full fuzzy matching (%xxx%), you can use gin to create an index, but pg_trgm supports at least three characters. using gin(c_ajmc gin_trgm_ops)
4. If you need to search the full text of multiple fields, such as querying case numbers, or case name, the two fields contain the value of "Beijing" or the case name contain "Beijing" and "Huayu" in the case name, you can use the full text search. For specific creation, refer to the above examples.
5. It should be noted that when the result set returned by the full text search is large, the efficiency of sorting according to weights will be very low! ! !
6. It should be noted that using full fuzzy matching, too few characters in the query will return more results, which will affect the query efficiency! ! !
7. If this field only needs to be matched after fuzzy only needs to be indexed: using btree(c_ajmc text_pattern_ops). If the field only requires a previous fuzzy match, then index using btree(reverse(c_ajmc)). If the field has a full fuzzy match and a front and back fuzzy match, you only need to create a gin index.
7.pg_trgm
8.zhparser
References
References
ts_rank
This is the end of this article about postgresql fuzzy matching killer. For more related postgresql fuzzy matching content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!