How to search Chinese full text using PostgreSQL

start

Install

First, install PgSQL. Here I am using PgSQL 9.6. PgSQL 10 has just been released. If you are interested, you can try it out.

The installation of PgSQL can be said to be very complicated. In addition to installing Server and Client, you also need to install the devel package. In order to implement the spatial indexing function, we also need to install the most important PostGIS plug-in. This plug-in requires a lot of dependencies. It is very complicated to install it manually by yourself and may make mistakes.

It is recommended to install automatically. Yum must cooperate with Yum sources like epel to ensure that you can catch all dependencies in one place. Of course, the best thing is to use docker to run it, just find a mirror.

Plugin

Since many functions of PgSQL are implemented by plug-ins, some commonly used plug-ins need to be installed, such as:

postgis_topology (manages topological objects such as faces, edges, points, etc.)

pgrouting (path planning)

postgis_sfcgal (implementing 3D correlation algorithm)

fuzzystrmatch (string similarity calculation)

address_standardizer/address_standardizer_data_us (address standardization)

pg_trgm (partial index)

After these plugins are compiled in the installation directory /path/extensions, they must be enabled using create extension xxx before using it in the database.

start up

1. Switch to a non-root user. (After PgSQL is installed, it will create a super user named postgres. We can use this super user to operate PgSQL. It is recommended to recreate an ordinary user to manage data later);

2. Switch to the /installPath/bin/ directory. PgSQL provides many commands in this directory, such as createdb, createuser, dropdb, pg_dump, etc.;

3. Use the createb command to initialize a folder dir_db (this directory cannot already exist) to store the database physical data, and use the -E UTF8 parameter to specify the database character set as utf-8;

4. Use pg_ctl -D dir_db to specify the database to start the background service;

5. Use psql -d db to log in to PgSQL on the command line;

Configuration

After installation, you need to configure some basic parameters to use normally.

Host permissions

PgSQL needs to configure the database Host permissions in the pg_hba.conf file to be accessed by other machines.

# TYPE DATABASE USER ADDRESS METHOD

local all all trust

host all all 127.0.0.1/32 md5

host all all 172.16.0.1/16 md5

The comments in the file introduce these fields in detail. We may need to add a host(IP) access item. ADDRESS is an ordinary network segment representation. METHOD recommends using md5, which means using md5 to encrypt the transmission password.

Server configuration

The server is configured in . After modifying the configuration, you need to use the pg_ctl restart -D dir_db command to restart the database;

In addition, we can also modify the configuration items after logging in to the database: use SELECT * FROM pg_settings WHERE name = 'config'; to query the current configuration items, and then use the UPDATE statement to update the configuration. However, some configurations such as memory allocation policy are only valid in the current session. Global effect requires modification in the configuration file and restarting the server.

We can modify the configuration and verify the optimization of SQL statements with the client, use \timing on to enable query timing, and use the EXPLAIN ANALYSE statement to analyze the efficiency of query statements. The following are two practical configuration parameters:

shared_buffers: Used to specify the amount of memory occupied by the shared memory buffer. It should be large enough to store commonly used query results to reduce physical I/O. But it cannot be too large to avoid the occurrence of system memory swap, which is generally set to 20% of system memory.
work_mem: A connected working memory. If the amount of data in the query result is large, this value will cause a large number of system I/O, causing a sharp drop in the query speed. If the read value in the buffer part of your explanation statement is too large, it means that the working memory is insufficient and you need to adjust and add this parameter. But this value cannot be too large. You need to ensure that work_mem * max_connections + shared_buffers + system memory < RAM, otherwise it may also cause system memory swap.

In this way, PgSQL can be used as a normal relational data.

Participle

The implementation of full-text index depends on PgSQL's gin index. Word segmentation function PgSQL has built-in English, Spanish, etc., but Chinese word segmentation requires the help of open source plug-in zhparser;

SCWS

To use zhparser, we first need to install the SCWS thesaurus. SCWS is the abbreviation of Simple Chinese Word Segmentation (i.e., Simple Chinese word segmentation system). Its GitHub project address is highman-scws. We can install it directly after downloading it.

After installation, you can use the scws command to test word segmentation on the command line, and its main parameters are:

-c utf8 Specify the character set
-d dict Specifies the dictionary. It can be in xdb or txt format
-M The level of compound participle, 1~15, bitwise XOR 1|2|4|8 represents short words|binary|main characters|all characters, default not compound participle, this parameter can help adjust the most desired participle effect.

zhpaser

1. Download zhparser source code git clone https://amutu/;

2. Before installation, you need to configure the environment variables: export PATH=$PATH:/path/to/pgsql;

&& make install compile zhparser;

4. Log in to PgSQL to use CREATE EXTENSION zhparser; enable plug-ins;

5. Add word segmentation configuration

CREATE TEXT SEARCH CONFIGURATION parser_name (PARSER = zhparser); // Add configurationALTER TEXT SEARCH CONFIGURATION parser_name ADD MAPPING FOR n,v,a,i,e,l,j WITH simple; // Set word segmentation rules （n noun v Verbs, etc.，Read the following documentation for details）

6. Add a gin index to the word participle result of a certain column create index idx_name on table using gin(to_tsvector('parser_name', field));

7. Use the scws command introduced in the previous section to test the word segmentation configuration in the command line. If I think the word segmentation result is the best when the compound level is 7, I add the configuration

zhparser.multi_short = true #Short word compound: 1

zhparser.multi_duality = true #Scattered characters binary composite: 2

zhparser.multi_zmain = true #Important single word compound: 4

zhparser.multi_zall = false #All single word compound: 8

SQL

In the query, we can use the simplest SELECT * FROM table WHERE to_tsvector('parser_name', field) @@ 'word' to query the data with the word word in the field field participle;

Use the to_tsquery() method to parse sentences into combination vectors of each word. For example, the return result of the National Grand Theater is 'Nation' & 'Grand Theater' & 'Grand Theater'. Of course, we can also use the & | symbol to splice the vectors we need; when querying long sentences, we can use SELECT * FROM table WHERE to_tsvector('parser_name', field) @@ to_tsquery('parser_name','words');

Sometimes we want to return the number of result entries synchronously like the SQL_CALC_FOUND_ROWS statement of MySQL, then we can use SELECT COUNT(*) OVER() AS score FROM table WHERE..., PgSQL will add a score field to each row of data to store the total number of result entries queried;

At this point, ordinary full-text search requirements have been implemented.

optimization

We then optimize the participle effect and efficiency:

Store word participle results

We can use a field to store word segmentation vectors and create an index on this field to better use word segmentation index:

ALTER TABLE table ADD COLUMN tsv_column tsvector;           // Add a participle fieldUPDATE table SET tsv_column = to_tsvector('parser_name', coalesce(field,''));   // Update the word segment vector of the field to a new fieldCREATE INDEX idx_gin_zhcn ON table USING GIN(tsv_column);   // Create index on new fieldsCREATE TRIGGER trigger_name BEFORE INSERT OR UPDATE  ON table FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_column, 'parser_name', field); // Create an update word participle trigger

In this way, when querying again, you can directly use SELECT * FROM table WHERE tsv_column @@ 'keyword'.

It should be noted here that when inserting data into the table, an error may be reported, prompting to specify the schema of parser_name. At this time, you can use the \dF command to view all the parameters of text search configuration:

List of text search configurations

Schema | Name | Description

------------+------------+---------------------------------------

pg_catalog | english | configuration for english language

public | myparser |

Note the schema parameter. When creating trigger, you need to specify the schema. As mentioned above, you need to use it.

Add a custom dictionary

We can download the xdb format lexicon online to replace the default dictionary. The lexicon can only be read by PgSQL in the share/tsearch_data/ folder. The default lexicon is dict. To use a custom thesaurus, you can place the thesaurus after the thesaurus folder and configure the zhparser.extra_dict="" parameter;

What should we do when we only have a txt vocabulary library and want to use this vocabulary as the default vocabulary library? Using scwsscwe-gen-dictAfter the tool or script found online generates xdb and puts it into the thesaurus folder, the word segmentation is reported in PgSQL and the reading of the thesaurus file fails. After many experiments, I have summarized a set of methods to make a dictionary file:

1. Prepare the source file of the lexicon: The format of each line of the content of the lexicon file is the word TF IDF part of speech, and the word is necessary. TF word frequency (Term Frequency), IDF inverse Document Frequency (Inverse Document Frequency) and part of speech are all optional. Unless you are sure that your dictionary information is correct and conforms to the configuration of scws, it is better to leave it blank and let scws determine it by itself;

2. Set zhparser.extra_dicts = "" and set zhparser.dict_in_memory = true;

3. Enter PgSQL on the command line and execute a word segmentation statement select to_tsquery('parser', 'any word'), word segmentation will be extremely slow, please be patient (please ensure that only one word segmentation statement is being executed at this time);

4. After the word participle is successful, find the generated one in the /tmp/ directory and replace share/tsearch_data/dict.;

5. Delete the extra_dicts dict_in_memory configuration you just added and restart the server.

Extended

Since the query is the name of the POI, it is generally shorter and many words have no semantics. Taking into account the user's input habits, the first few characters of the POI name will be input, and the accuracy of the word segmentation of scws cannot reach 100%, so I added a prefix query of the name to improve the accuracy of the query, that is, using B-tree index to implement the query of LIKE 'keyword %'. Need here

It should be noted here that when creating an index, the operator class must be configured according to the field type, otherwise the index may not take effect. For example, creating an index on a field with a field type varchar requires the use of the statement CREATE INDEX idx_name ON table(COLUMN varchar_pattern_ops), here varcharpatternops is the operator class.

Since then, a good full-text search system has been completed.

Summarize

Simple data migration is not the end point, and there are still many things to be done later, such as data synchronization of the entire system, query efficiency optimization, query function optimization (adding pinyin search, fuzzy search), etc. Especially the query efficiency, I don’t know if there is a problem with my configuration, and it cannot reach the E-level millisecond speed at all. The data efficiency of 1kw has dropped significantly (200ms) when the large result returns, so I had to honestly sub-table it. At present, the query speed at one million is within 20ms, and there is still a way to go to optimize.

However, this time I have a deeper understanding of the "ecology" of technology. In this regard, PgSQL is indeed far from MySQL. No matter how weird the questions are when using MySQL, you can quickly find answers to them online. PgSQL is awkward. The entry-level question search * has only a few unmatched answers. Although there are also great masters like Alibaba’s “De Ge” who are working hard to preach, the number of users is the foundation. However, as PgSQL becomes more and more perfect, more and more people will use it. My article has been heated up for PgSQL, haha~ I hope it can help future users.

The above is the detailed content of how to use PostgreSQL for full-text search in Chinese. For more information about using PostgreSQL for full-text search in Chinese, please pay attention to my other related articles!