MYSQL transaction deadlock problem investigation and solution

Problem phenomenon

The java service has reported a large number of error logs. You can see the appendix for details. The error is basically reported in the following aspects.

Caused by: : Read timed out
Caused by: : No operations allowed after connection closed.
The last packet successfully received from the server was 10,003 milliseconds ago. The last packet sent successfully to the server was 10,003 milliseconds ago.
Cause: : Deadlock found when trying to get lock; try restarting transaction
Cause: : Lock wait timeout exceeded; try restarting transaction

Speculation 1 - Client retry configuration without error

Due to the 1st, 2nd and 3rd day checks, most of the logs that appear are Read timed out and No operations allowed after connection closed, and the larger problem isAfter an error is reported, the service enters a fake death and cannot be used unless it is restarted.

Initially suspected that there is a problem with the client connection pool or the configuration of obtaining MySQL connection

The connection pool has been adjusted, adding timeout and reconnect parameters:

@Override
public DataSource getDataSource() {
    DruidDataSource dataSource = new DruidDataSource();
    (("url"));
    (("k1"));
    (("k2"));
    (30);
    (5);
    (60);
    (3);
    (60000);
    (false);
    (-1);
    ("select 'x'");
    (30000);
    (false);
    (false);
    (true);
    (true);
    (300000);
    (60000);
    (true);
    (10);
    (1000);
 
    Properties connectProp = new Properties();
    ("", "true");
    ("", "5000");
    (connectProp);
 
    try {
        ();
    } catch (SQLException e) {
        ((), e);
    }
    return dataSource;
}

The MySQL connection is adjusted as follows, and add allowPublicKeyRetrieval:

...&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true";

Result: The error is still reported, not resolved.However, the problem of service fake death was solved. After an error is reported, if only a small number is performed, such as 2 concurrency, the service is available.

Speculation 2 - Client timeout time is too short

It is suspected that it is a MySQL performance problem. It takes so long for SQL to operate. Try to increase the client's timeout time and try again.

@Override
public DataSource getDataSource() {
    DruidDataSource dataSource = new DruidDataSource();
    (("url"));
    (("k1"));
    (("k2"));
    (60);
    (5);
    (105);
    (30);
    (60000);
    (false);
    (-1);
    ("select 'x'");
    (60000);
    (false);
    (false);
    (true);
    (true);
    (300000);
    (60000);
    (true);
    (10);
    (1000);
    (150000);
    (150000);
    (150000);
 
    Properties connectProp = new Properties();
    ("", "true");
    ("", "5000");
    (connectProp);
 
    try {
        ();
    } catch (SQLException e) {
        ((), e);
    }
    return dataSource;
}

mybatis configuration:

&lt;!-- Configuration --&gt;
&lt;settings&gt;
    &lt;setting name="mapUnderscoreToCamelCase" value="true"/&gt;
    &lt;setting name="useGeneratedKeys" value="true"/&gt;
    &lt;setting name="defaultStatementTimeout" value="60"/&gt;
&lt;/settings&gt;

Result: The error is still reported, not resolved.But the timeout time is long, and the number of successes has increased.

Speculation 3 - MySQL version issue

The problem online is MySQL 8, pull it locally for testing

docker pull /vwf-base/mysql:8.0.31
 
docker run -it \
-p 33305:3306 \
-e MYSQL_ROOT_PASSWORD='123456' \
--volume /tmp/docker-cps/wf-base/tmp/mysqldata:/var/lib/mysql /vwf-base/mysql:8.0.31

Result: The error is still reported, not resolved.5.7 is the same as 8.0 error message

Speculation 4 - The concurrency number of client connection pool is too low

Increase the maximum number of threads to druid to 105, and the default maximum number of connections to MySQL is 151 (show variables like '%max_connection%';）

When starting concurrency, continue to run the following instructions in MySQL.show status like 'Threads%';, monitor the number of concurrent threads (Threads_connected)

**Result: Still reported an error, not resolved. **The concurrency will not exceed 60, which means that the maximum 60 concurrency configured in druid is fully met with performance requirements.

Speculation 5 - MySQL service performance is low

Do pressure tests on MySQL

First homebrew install sysbench

Create a database benchmark for MySQL and prepare data prepare

sysbench \
        --test='/usr/local/Cellar/sysbench/1.0.20_3/share/sysbench/tests/include/oltp_legacy/' \
        --oltp-tables-count=1 \
        --report-interval=10 \
        --oltp-table-size=1000000 \
        --mysql-user=root \
        --mysql-password=123456 \
        --mysql-table-engine=innodb \
        --rand-init=on  \
        --mysql-host=127.0.0.1 \
        --mysql-port=13336 \
        --mysql-db=benchmark \
        --time=300 \
        --max-requests=0 \
        --oltp_skip_trx=on \
        --oltp_auto_inc=off \
        --oltp_secondary=on \
        --threads=50 \
        prepare

After finishing, change prepare to run, that is, start running pressure measurement on the local computer and counting MySQL performance data

The benchmark of 243 is as follows:

transactions: 26872 (tps 89.45 per sec.)
queries: 483696 (qps 1610.09 per sec.)

The local MySQL process (non-docekr) is as follows:

transactions: 185025 (tps 616.36 per sec.)
queries: 3330450 (qps 11094.40 per sec.)

MySQL running local docker is as follows:

transactions: 18324 (tps 60.97 per sec.)
queries: 329832 (qps 1097.50 per sec.)

It can be found that the performance of non-docker running is far higher than that of MySQL running by 243 and docker running, with performance above 6 times. Because the performance is very good, the bug is not reproduced, while the performance of mysql running by docekr is generally, so the bug can be reproduced.

Finally, the machine operates and maintains a 48C + 128G, and runs a mysql service separately, with the performance as follows:

48C + 128G MySQL service performance:

transactions: 86301 (tps 287.42 per sec.)
queries: 1553418 (qps 5173.49 per sec.)

The performance of this server is almost 3 times that of 243. In the R&D environment, migrate the database of the java service to this MySQL and try again

Result: The error is still reported, not resolved.At this point, it is concluded that the performance of MySQL is not a problem

Speculation 6 - Client code not closed resources

The client, that is, the java service, is currently using Mybatis to obtain things made by SqlSession. According to the official website, you can know that SqlSession is thread-unsafe; on the other hand, there may be some places where SqlSession is not closed. If it is not closed, it will cause hang and may cause serious consequences.

Two solutions are done to address the highlights of the official website document:

Make the SqlSession of the class variable into a local variable of the method
Each SqlSession is made using try-with-resource

Finally start the test

Result: The error is still reported, not resolved.

Speculation 7 - Client code leads to long transactions

Judging from the error log of the java service, the constant error is caused by the DELETE timeout

On the other hand, after each error, from the transaction information of MySQL, you can monitor that the DELETE statement is in LOCK WAIT

- 1、Query lock information
select * from `sys`.`innodb_lock_waits`;
 
-- 2、Query lock information
select * from `performance_schema`.data_locks;
 
-- 3、Query lock waiting information
select * from `performance_schema`.`data_lock_waits`;
 
-- 4、Query transaction information
select * from `information_schema`.innodb_trx;
 
-- 5、Query event information
select * from `performance_schema`.`events_statements_history`;
 
 
-- 6、View the current lock holding statement
SELECT * FROM performance_schema.events_statements_history WHERE thread_id IN(
SELECT b.`THREAD_ID` FROM sys.`innodb_lock_waits` AS a , performance_schema.threads AS b
WHERE a.`blocking_pid` = b.`PROCESSLIST_ID`)
ORDER BY timer_start ASC;
 
 
-- 7、View the currently locked statement
SELECT * FROM performance_schema.events_statements_history WHERE thread_id IN(
SELECT b.`THREAD_ID` FROM sys.`innodb_lock_waits` AS a , performance_schema.threads AS b
WHERE a.waiting_pid = b.`PROCESSLIST_ID`)
ORDER BY timer_start ASC;
 
-- 8、Check out the last deadlock log（Status The value of the field）
    SHOW ENGINE INNODB STATUS;

On the other hand, the last oneSHOW ENGINE INNODB STATUS;The deadlock log can be found

Here is analyzing the deadlock log

TRANSACTION 1883862 Running sentence：DELETE FROM `tb_entity` WHERE `directory` LIKE concat('cloud_data/133/task_k35846zp', '%')
 
TRANSACTION 1883862 Record lock held：【space id 2 page no 309 n bits 152；heap no 42 PHYSICAL RECORD】
 
TRANSACTION 1883862 Wait for record lock：【space id 2 page no 303 n bits 152；heap no 72 PHYSICAL RECORD】
 
===
 
TRANSACTION 1883861 Running sentence：DELETE FROM `tb_entity` WHERE `directory` LIKE concat('local_data/134/sampling_output_uqlv65t2', '%')
 
TRANSACTION 1883861 Record lock held：【space id 2 page no 303 n bits 152；heap no 72 PHYSICAL RECORD（There are many here，Omit other...）】
 
TRANSACTION 1883861 Waiting for record lock：【space id 2 page no 309 n bits 152；heap no 42 PHYSICAL RECORD】

Judging from the deadlock log, two transactions want to acquire the locks held by each other, and a deadlock occurs

In other words, although these are two different SQL statements and the records produced by LIKE do not have intersections, it will also cause deadlocks.This is because for the operations of the LIKE update class (UPDATE, INSERT), MySQL transactions will acquire record locks for all records. In the case of concurrent transactions, it is easy to generate deadlocks.

Solution

Here we know the final reason. In case of large concurrency, the java service DELETE statement will acquire all records due to LIKE. At this time, concurrent competition for transactions occurs, resulting in deadlocks.

Solution 1. Transform the DELETE statement on the java server, query the ids that need to be deleted, and delete them in batches in ids; but it is recommended to select ids limit n and in ids to delete them, because there may be many ids to find them all at once; here, it is avoided that if multiple DELETE transactions acquire concurrently, all record locks are obtained, resulting in deadlocks.This solution can solve the fundamental problem
Scheme 2. Serialize the concurrent statements at the business level. For example, the execution of this DELETE transaction statement can only solve the problem of DELETE transactions in this scenario. If there are other UPDATE transactions and other DELETE transactions that have this problem, it also needs to be changed.This solution can solve the problem, but it is quite troublesome. It requires business adaptation to modify the code, which will reduce the transaction performance of MySQL.
Solution 3: Improve MySQL configuration performance, reduce transaction execution time, and reduce transaction concurrent competition time.This solution may not meet the implementation conditions in reality and cannot fundamentally solve the problem
Scheme 4. Set the transaction level of MySQL to serial, the default level is RR, and it is set to serial, and there is no transaction concurrency problem.This solution can solve the problem, but MySQL's transaction performance is greatly reduced

Summarize

Do not use LIKE in DELETE and UPDATE

First SELECT id WHERE … LIKE … LIMIT N, then use DELETE / UPDATE … IN (ids) to do it

Check the deadlock log when the operation timeout

View the most recent deadlock log (value of the Status field): SHOW ENGINE INNODB STATUS; Focus on the possibility of problems in the business code of SQL in the deadlock log

Concurrency issues focus on data estimation and performance

The data scale is not large, or the service performance is excellent, and the probability of concurrency problems is low

Currently, according to pressure measurement

Collapse source code

sysbench \
        --test='/usr/local/Cellar/sysbench/1.0.20_3/share/sysbench/tests/include/oltp_legacy/' \
        --oltp-tables-count=1 \
        --report-interval=10 \
        --oltp-table-size=1000000 \
        --mysql-user=root \
        --mysql-password=123456 \
        --mysql-table-engine=innodb \
        --rand-init=on  \
        --mysql-host=127.0.0.1 \
        --mysql-port=13336 \
        --mysql-db=benchmark \
        --time=300 \
        --max-requests=0 \
        --oltp_skip_trx=on \
        --oltp_auto_inc=off \
        --oltp_secondary=on \
        --threads=50 \
        run

The database table is 3w data, each DELETE transaction hits 1000 data, and 20 transactions are sent concurrently:

MySQL performance is as follows, with a probability of about 70% triggering deadlock:

transactions: 18324 (tps 60.97 per sec.)
queries: 329832 (qps 1097.50 per sec.)

The performance of mysql is as follows, and there is no deadlock reproduction:

transactions: 185025 (tps 616.36 per sec.)
queries: 3330450 (qps 11094.40 per sec.)

But mysql may not have such conditions in reality

Summarize

This is the article about the troubleshooting and solutions of MYSQL transaction deadlock problems. For more related MYSQL transaction deadlock problems, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!