PostgreSQL Tutorial (8): Detailed Index Explanation

1. Index type:

PostgreSQL provides a variety of index types: B-Tree, Hash, GiST, and GIN. Since they use different algorithms, each index type has its appropriate query type. By default, the CREATE INDEX command will create a B-Tree index.

1. B-Tree:

Copy the codeThe code is as follows:

    CREATE TABLE test1 (

        id integer,

        content varchar

    );

    CREATE INDEX test1_id_index ON test1 (id);

B-Tree index is mainly used for equal to and range queries, especially when the index column contains operators " <, <=, =, >= and >" as query conditions, PostgreSQL's query planner will consider using B-Tree index. PostgreSQL can also use B-Tree indexing in queries using BETWEEN, IN, IS NULL, and IS NOT NULL. However, for queries based on pattern matching operators, such as LIKE, ILIKE, ~ and ~*, the index will only take effect if there is a constant in the pattern and the constant is at the beginning of the pattern string, such as col LIKE 'foo%' or col ~ '^foo', otherwise a full table scan will be performed, such as: col LIKE '%bar'.

2. Hash：

Copy the codeThe code is as follows:

CREATE INDEX name ON table USING hash (column);

Hash index can only handle simple equal comparisons. When index columns are compared with equal operators, the query planner considers using hash indexes.
It should be noted here that the performance of PostgreSQL hash index is not stronger than that of B-Tree index, but the size and construction time of hash index are worse. Also, since the hash index operation currently does not record WAL logs, we will have to rebuild the hash index with REINDEX once a database crash occurs.

    3. GiST：
GiST index is not a separate index type, but a schema on which many different index strategies can be implemented. This allows GiST indexes to use specific operator types according to different index policies.

    4. GIN：
A GIN index is an inverted index that can handle values (such as arrays) containing multiple keys. Similar to GiST, GIN also supports user-defined indexing policies, so that GIN indexes can use specific operator types according to different indexing policies. As an example, the PostgreSQL standard release includes GIN operator types for one-dimensional arrays, such as: <@, @>, =, &&, etc.

2. Composite index:

Indexes in PostgreSQL can be defined on multiple fields in a data table, such as:

Copy the codeThe code is as follows:

    CREATE TABLE test2 (

        major int,

        minor int,

        name varchar

    }

    CREATE INDEX test2_mm_idx ON test2 (major, minor);

In the current version, only B-tree, GiST, and GIN support composite indexes, with up to 32 fields available.
1. Composite index of B-Tree type:
In a composite index of type B-Tree, any subset of the index field can be used for querying conditions, but the highest efficiency can be achieved only if the first index field (leftmost) in the composite index is included.

2. Composite index of GiST type:
In a compound index of GiST type, it is only possible to decide how much index data the query will scan when the first index field is included in the query condition, while the conditions on other index fields only limit the entries returned by the index. If most of the data on the first index field has the same key value, then applying the GiST index will be inefficient at this time.

3. Composite index of GIN type:
Unlike B-Tree and GiST indexes, GIN composite indexes are not affected by which subsets of index fields are used in the query conditions, and no matter which combination, they will get the same efficiency.

Use composite indexes with caution. In most cases, an index on a single field is sufficient and also saves time and space. Unless the table usage pattern is very fixed, indexes of more than three fields are of little use.

3. Combining multiple indexes:

PostgreSQL can combine multiple indexes during query (including multiple uses of the same index) to handle situations where a single index scan cannot be implemented. At the same time, the system can also form the conditions of AND and OR between multiple index scans. For example, a query similar to WHERE x = 42 OR x = 47 OR x = 53 OR x = 99 can be broken down into four independent scans based on x-field indexes, each scan uses a query clause, and then OR these scan results together and generate the final result. Another example is that if we have independent indexes on x and y, a query similar to WHERE x = 5 AND y = 6 will scan based on the indexes of these two fields, and then AND the results of each scan are AND and the final result row is generated.

In order to combine multiple indexes, the system scans each required index and then organizes a BITMAP in memory, which will give the physical location of the data scanned from the index in the data table. Then, according to the needs of the query, these bitmaps are AND or OR operations and the final BITMAP is obtained. Finally, retrieve the data table and return the data rows. The data rows of the table are accessed in physical order, because this is the layout of the bitmap, which means that any original index sort will disappear. If there is an ORDER BY clause in the query, there will be an additional sorting step. For this reason, and each additional index scan adds extra time, so the planner sometimes chooses to use simple index scans, even if multiple indexes are available.

4. Unique index:

Currently, only B-Tree indexes can be declared as unique indexes.

Copy the codeThe code is as follows:

CREATE UNIQUE INDEX name ON table (column [, ...]);

If the index is declared as a unique index, multiple rows with the same index value are not allowed. We believe that NULL values are not equal to each other.

5. Expression index:

Expression index is mainly used in the query condition when there are results of functions or expressions based on a certain field and other values, such as:

Copy the codeThe code is as follows:

SELECT * FROM test1 WHERE lower(col1) = 'value';

At this time, if we just create an index on the col1 field, then the query will not use the index when executing, but will directly scan the full table. If the table has a large amount of data, it will take a long time to execute the query. The solution to this problem is very simple, and establish an expression index based on the col1 field on the test1 table, such as:
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
If we declare the index as UNIQUE, it will prohibit the creation of data rows with col1 values that are just case-different data rows, and data rows with exactly the same col1 values. Therefore, indexes on expressions can be used to force constraints that cannot be defined as simple unique constraints. Now let's look at another example of applying expression indexing.

Copy the codeThe code is as follows:

SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';

Like the example above, although we may create independent indexes for first_name and last_name respectively, or composite indexes based on these two fields, these indexes will not be used when executing the query statement. The only indexes that the query can use are the expression indexes we created below.

Copy the codeThe code is as follows:

CREATE INDEX people_names ON people ((first_name || ' ' || last_name));

The syntax of the CREATE INDEX command usually requires parentheses written around the index expression, as we show in the second example. If the expression is just a function call, it can be omitted, as we showed in the first example.

From the perspective of index maintenance, index expressions are relatively inefficient, because when inserting data or updating data, the result of the expression must be calculated for the row and stored directly in the index. However, when querying, PostgreSQL will treat them as WHERE idxcol = 'constant', so the search speed is equivalent to a query based on a simple index. Generally speaking, we should just use expression indexing in scenarios where the search speed is more important than the insertion and update speed.

6. Partial index:

A partial index is an index built on a subset of a table, and the subset is defined by a conditional expression (called a predicate of partial index). This index contains only those rows in the table that satisfy this predicate.
Since indexes are not required in all cases, some indexes will improve the efficiency of data insertion and data updates. However, because partial indexes are smaller than ordinary indexes, the query efficiency of the index parts can be improved better. See the following three examples:
1. The index field and the predicate condition field are consistent:

Copy the codeThe code is as follows:

    CREATE INDEX access_log_client_ip_ix ON access_log(client_ip)

        WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');

The following query will use this part of the index:

Copy the codeThe code is as follows:

SELECT * FROM access_log WHERE url = '/' AND client_ip = inet '212.78.10.32';

The following query will not use this part of the index:
A query that cannot use this index can be:

Copy the codeThe code is as follows:

SELECT * FROM access_log WHERE client_ip = inet '192.168.100.23';

2. The index field and the predicate condition field are inconsistent:
PostgreSQL supports partial indexes with arbitrary predicates, and the only constraint is that the fields of the predicate must also come from the same data table. Note that if you want your query to use partial indexes, then you must have the conditional part of the query statement exactly match the predicate of the partial index. To be precise, this partial index can only be used for the query if PostgreSQL can recognize that the WHERE condition of the query mathematically covers the predicate of the index.

Copy the codeThe code is as follows:

    CREATE INDEX orders_unbilled_index ON orders(order_nr) WHERE billed is not true;

 

The following query will definitely use this part of the index:

Copy the codeThe code is as follows:

SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;

So what about the following query?

Copy the codeThe code is as follows:

SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;

This query will not be as efficient as the query above. After all, the query condition statement does not use the index field, but the query condition "billed is not true" exactly matches the predicate of partial indexes, so PostgreSQL will scan the entire index. This way, the query can only be more effective if the index data is relatively small.

The following query will not use partial indexes.

Copy the codeThe code is as follows:

SELECT * FROM orders WHERE order_nr = 3501;

3. Uniqueness constraints for subsets of data tables:

Copy the codeThe code is as follows:

    CREATE TABLE tests (

        subject text,

        target text,

        success boolean,

        ...

    );

    CREATE UNIQUE INDEX tests_success_constraint ON tests(subject, target) WHERE success;

This part of the index will only uniquely constrain data with the success field value true. In actual applications, if there is less successful data and more unsuccessful data, the implementation method will be very efficient.

7. Check the use of indexes:

See the following four suggestions:
1. Always run ANALYZE first.
This command will collect statistics on the numerical distribution status in the table. This information is required when estimating the number of rows returned by a query, and the planner needs this number of rows to assign a real overhead value to each possible query plan. If any real statistics are lacking, some default values will be used, which is definitely inaccurate. Therefore, if you check the usage of an index before ANALYZE has been run, it will be a failed check.
2. Use real data to do experiments.
If you fill the data table with test data, the index of the table will only evaluate how the index should be used based on the test data, rather than using it in this way for all data. For example, if you select 1000 rows from 100000 rows, the planner may consider using indexes. Then if you select 1 row from 100 rows, it is difficult to say that you will also use indexes. Because 100 rows of data are likely to be stored in a disk page, however, no query planning is more efficient than accessing a disk page by sequentially. At the same time, when simulating test data, it is also important to note that if these data are very similar data, completely random data, or data inserted in sorted order, it will deviate the statistical information from the characteristics that the actual data should have.
3. If the index is not used, then forcing its use in the test may be of some value. There are some runtime parameters that can turn off a variety of query plans.
4. Forced use of index usage will lead to two possibilities: one is that the system selection is correct, and using indexes is actually not appropriate, and the other is that the overhead calculation of query plans cannot reflect the actual situation. In this way, you should time queries that use and do not use indexes. At this time, the EXPLAIN ANALYZE command is very useful.