Detailed explanation of StarRocks index (Latest compilation)

1. Primary Key Index

Principle: Primary key index is based on physical sorting storage of data. In StarRocks, a table with primary keys is defined, and its data will be arranged in an orderly manner according to the value of the primary key field. This not only provides unique constraints, but also ensures that primary key-based queries can quickly locate records through jump lists or similar data structures.

Case: Suppose there is a user behavior tableuser_action, its primary key is defined as(user_id, action_time), which means that StarRocks will automatically create indexes for primary key fields and store them in order to combine these two fields. Primary key index works efficiently when executing the following query:

-- Specify the primary key when creating a table
CREATE TABLE user_data (
    user_id INT NOT NULL,
    name STRING,
    gender ENUM('Male', 'Female'),
    registration_date DATE,
    PRIMARY KEY (user_id)
) ENGINE=OLAP
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 100;

SELECT * FROM user_data WHERE user_id = 123;

2. Prefix Index (Prefix Index / ShortKey Index)

Principle: For part of a composite key or the first few characters of a string type column, you can create a prefix index to reduce index space and optimize certain queries. For example, indexing the first N characters of a long string can be used to match some keywords at the beginning.

Case: If the table sort key is(country_code, user_id),andcountry_codeis a low cardinality sequence (such as country code), StarRocks will automatically build the prefix index. When the query involvescountry_codeWhen, for example:

SELECT * FROM user_table WHERE country_code = 'CN';

-- Case： Suppose there is a mobile number column phone_number，And often query by area code，You can create prefix indexes：
CREATE TABLE users (
    ...
    phone_number VARCHAR(20),
    INDEX idx_phone_number (phone_number(7)) -- forward7Area Code Index
);
-- 使用forward缀索引Case
SELECT * FROM users WHERE phone_number LIKE '010%';

3. Bitmap index

Principle: Bitmap index is especially suitable for highly discrete columns with low cardinality, such as gender, region and other category attributes. It maps each unique value to a bitmap where each bit represents whether a row of data contains the value. When multiple bitmaps require intersection, union and other operations, they only need to perform logical operations on the bitmaps to achieve efficient set operation query.

Case: Suppose there is a gender columngender, and its value has only two states (male/female). To quickly count the number of male and female users, you cangenderColumn creates a Bitmap index. The query is as follows:

CREATE BITMAP INDEX idx_gender ON example_table(gender);

SELECT COUNT(*) FROM user_data WHERE gender = 'Female';

4. Bloomfilter index

Case: Using Bloomfilter index on high cardinality sequences (such as order IDs) can help quickly eliminate blocks of data that certainly do not exist in the value found, reducing unnecessary data readings. For example:
Suppose we have a name calleduserstable containingidandnameTwo fields, we want toidCreate a Bloom filter on the field:

CREATE TABLE users (  
    id BIGINT COMMENT 'User ID',  
    name STRING COMMENT 'username'  
) ENGINE=OLAP  
DUPLICATE KEY(id)  
COMMENT 'User Table'  
PROPERTIES (  
    "bloom_filter_columns" = "id"  
);

-- Towards users Insert some data into the table：
INSERT INTO users (id, name) VALUES (1, 'Alice');  
INSERT INTO users (id, name) VALUES (2, 'Bob');  
INSERT INTO users (id, name) VALUES (3, 'Charlie');

SELECT * FROM users WHERE id = 4;

Because we areidA Bloom filter is created on the field. StarRocks can first check the Bloom filter to judge.idWhether the record of 4 may not exist. If the Bloom filter determines thatidIf it does not exist, then StarRocks can directly return empty results without further scanning of tables or indexes.

It should be noted that the Bloom filter can only be used to reduce unnecessary query operations, and cannot guarantee the accuracy of query results. Therefore, even if the Bloom filter determines a certainidIt may exist, we still need to scan the table or index to confirm theidWhether it really exists.

Furthermore, the false positive rate of a Bloom filter depends on its configuration and the size of the bit array used. In practical applications, we need to reasonably configure the Bloom filter according to the characteristics of the data and query requirements to achieve the best query performance and accuracy.

Case 2: Suppose we have a name calleduserstable, one of thememailField, we want to create a Bloom filter on this field:

CREATE TABLE users (  
    id INT,  
    email VARCHAR(255),  
    name VARCHAR(255),  
    age INT,  
    INDEX idx_email_bloom (email) USING BLOOM_FILTER COMMENT 'Bloom filter on email'  
) DISTRIBUTED BY HASH(id) BUCKETS 10;

INSERT INTO users (id, email, name, age) VALUES (1, 'user1@', 'User One', 30);  
INSERT INTO users (id, email, name, age) VALUES (2, 'user2@', 'User Two', 25);  
-- Insert more data...

SELECT * FROM users WHERE email LIKE 'user%';

This is all about this article about the detailed explanation of StarRocks index. For more related StarRocks index content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!