Detailed explanation of database SQL query performance optimization

Preface

For a small system with only a few thousand rows or even tens of thousands of rows of data, database query optimization is not very effective, but for large application systems, data is often over millions. You need to understand the process and optimization algorithm of DBMS for query statements, and better use its optimization algorithm to improve the performance of the system.

Query execution process

Execution of a query statement requires several key steps such as query analysis, query inspection, query optimization, query execution, etc., and the specific description is as follows:

Query analysis: Scan, lexical analysis and grammatical analysis of query statements, identify language symbols from query statements, conduct grammatical checks and grammatical analysis

Query check: Perform semantic checks on legitimate query statements based on the data dictionary. Including checking the user's access permissions and checking the integrity constraint definition; after passing the inspection, the SQL query statement is converted into equivalent relational algebraic expressions, and generally using a query tree (grammatical analysis tree) to represent relational algebraic expressions

Query optimization: The database management system will automatically select an efficient query processing strategy. The so-called query optimization is an efficient query processing strategy that our DBMS can choose from, providing conditions, such as: establishing an index of which type.

Query execution: The code generator generates code to execute the query plan and executes the corresponding instructions. Query optimization is to select the optimal algorithm for optimization based on the access path established by DBMS based on physical design. DBMS query optimization includes algebraic optimization and physical optimization.

Common algorithms for single table selection operations

Full table scan: Scan the basic tables of the query in sequence, check whether each tuple meets the selection conditions one by one, and output the tuples that meet the conditions as the result. This algorithm is suitable for small tables, not large tables
Index Scan: First find the main code or tuple pointer that meets the conditions through the index, and then directly find the tuple in the basic table of the query through the tuple pointer. The index uses B+ tree or hash hash. When there are many tuples, the speed is faster than scanning the entire table.

Inspiration on selecting operations

For small relationships, use full table sequential scans, even if there is an index on the selected column
For large relationships, the selection condition is a query with the main code = value, and the result is only one tuple, and the main code index is selected;
For query with large relationships, non-main attribute = value, if there is an index on the selected column, the DBMS determines the access method based on the number of tuples in the query result. If the tuple proportion is small (<10%), use the index scanning method, otherwise use the full table order scan
The query condition contains the combined selection conditions for AND connection. If the combined index (such as the condition is the student number and the class joint index), the combined index scanning method is preferred.
The query conditions contain the combination selection conditions for AND connections. If there are general indexes on some properties (such as only student number indexes), you can use the index scanning method. If there is no index, you can use full table scanning.
The query conditions include or, between,! =, <>, in, whether it is empty (null), the index will not be used.

in conclusion: Using full table scan or index scan is selected by the DBMS based on the number of records in the table and the number of data records found.

Generally speaking, the number of records selected is greater than 20% of the total number of records in the entire table or there are very few records in the table, and full table scanning is used. Therefore, if there are few records in the table or many records are selected for each query, the index created cannot improve performance.

Connection operation implementation algorithm

Taking SELECT * FROM student, sc WHERE = as an example, assuming that 1000 records in the student table and 10000 records in the sc table

Nested loops: For each tuple (s) of the outer loop (s), search for each tuple (s) in the inner loop (sc), check whether these two tuples are equal in the connection attribute (sno). If the connection condition is met, it is connected as the result and output until the tuple in the outer loop table is processed. The number of executions is 1000*10000=10 7 ^7 7 times
Sort and merge: First sort the sno in the student and sc tables, take the first sno in the student table, and scan the tuples with the same sno in the sc table in sequence. When scanning the first SC tuple with different Snos, return to the Student table to scan its next tuple, and then scan the tuples with the same sno in the sc table, and connect them until the student table scan is completed. The number of executions is 10000=10 4 ^4 4 times, but the time it takes to sort is added
Index connection: If there is no index of attribute sno on the sc table, create an sno index; for each tuple in the student, the sno value searches the corresponding SC tuple through the SC index; connect these SC tuples and Student tuples; until the tuples in the Student table are processed. The number of connections is 1000= 10 3 ^3 3 times, but it requires the search time of the index
Hash connection: Use the connection attribute as hash code, and hash the tuples in R and S into the same hash file using the same hash function. The first step is to process one of the smaller students in the two tables and distribute its tuples into the hash bucket according to the hash function. The second step is to process the other table (sc) once, hash the tuples of S into the appropriate hash bucket, and connect the tuples with all the tuples in the bucket that come from R and match it.

Inspiration on connection operations

If both tables have been sorted by join properties, select the sort-merge method
If a table has an index on the join property, use the index join method
If none of the above two rules apply, one of the tables is small, use the Hashjoin method
You can choose a nested loop method and select a smaller table. To be precise, it is a table with less blocks (b) occupied as an outer table (the table of the outer loop)

in conclusion: When the number of ancestors in the table is relatively large, it is recommended to establish an index, which will greatly improve the search efficiency. However, it is also cost to perform index searches that require indirect acquisition of data through index pointers and manage indexes. When there are fewer ancestors, DBMS will not choose sorting merge and index connection. DBMS may choose a Hash connection when there are fewer table ancestors, more one, and no indexes

Relational algebra optimization

Centralized database: Total cost = I/O cost + CPU cost + memory cost, mainly considering I/O cost;

Distributed database: Total cost = I/O cost + CPU cost + memory cost + communication cost, mainly considering I/O cost and communication cost

[Example]: Suppose there are 1,000 student records and 10,000 course selection records in the student-curriculum database, of which 50 course selection records are taken for elective course No. 2, and the cost of this SQL expression is calculated using different relational algebras. SELECT FROM Student, SC WHERE = AND = ‘2’; Suppose a block can hold 10 Student tuples or 100 SC tuples, and 5 Student tuples and 1 SC tuple can be stored in memory.

πsname(σ=∧=‘2’ (student×sc)) The two relationships of student and sc are used as Cartesian product. The spliced priest is moved out of memory and written to the intermediate file. After the splicing is completed, it will be read in and performed selection operations. The spliced priest is 10 yuan, and 20 yuan is read per second. The time cost is ≈105+2×5×10 4 ^4 4≈10 5 ^5 5s=27.8 hours
πsname(σ='2' (student ⋈ sc)) means that the two relationships are connected naturally first, and then selected and projected. In this way, after the students and the course match, there are up to 10,000 tuples, that is, the number of SC tuples. The cost of writing and reading intermediate files is greatly reduced. Similarly, 10 Yuan Ancestors per block, 20 yuan per second will require time cost ≈105+50+50≈205s
πsname(Student ⋈ σ=’2’(sc)) First select the sc table. After the selection is completed, there are up to 50 pieces of data that meet the conditions. The total execution time is ≈5+5≈10s
In the third case, if the sc table cno has an index, the number of index blocks read is smaller, about 3·4 blocks, and if the Sno on the student also has an index, there is no need to read all student originals, the time is faster, and it can be read in about 1 second.

Algebraic optimization inspired, minimize I/O operations and reduce the number of times you write and read data blocks, which can improve performance, which has the following inspiration for us.

The selection operation should be done first as much as possible. This is the most important and basic one in the optimization strategy
Perform projection and selection operations simultaneously to reduce the number of reading and writing intermediate files
Combining projection with binocular operations before or after it
Reduce the number of scan relationships
Some choices are combined with the Cartesian product to be performed before it to become a connection operation
Find common subexpressions

Summarize

Query optimization technology is a key technology in each type of database management system and is automatically completed by DBMS. However, do not put all the optimization tasks on RDBMS. You should find out the optimization rules of RDBMS to write SQL statements suitable for RDBMS automatic optimization. For queries that cannot be optimized by RDBMS, you need to re-plan and write the query statement and manually adjust it to optimize the query performance.

This is the end of this article about the detailed explanation of database SQL query performance optimization. For more related database query optimization content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!