1. Background
Mysql is a popular open source database and is widely used in major Internet companies. Usually we can meet the needs with one library, but as the business grows, the amount of data and concurrency expand rapidly.
When the Mysql single table data volume reaches more than one million, the query efficiency will be affected. In addition, the concurrency that the Mysql single library can withstand is limited.
At this time, we need to divide the database and divide the tables to improve the performance and scalability of the database
2. Why are the database and tables divided?
2.1 Score
Too large data volume in a single table will greatly affect the SQL execution efficiency. Generally speaking, when a single expression reaches a few million, the performance will be relatively poor, and you need to divide the table.
2.2 Sub-store
A single library generally supports up to 2000 concurrency. If it exceeds the database, it is necessary to divide the database. It is best to control the concurrency value of a healthy single library at around 1000.
# | Before the database is divided | After the database is divided |
---|---|---|
Concurrent support situation | MySQL stand-alone deployment, cannot withstand high concurrency | MySQL has increased concurrency from stand-alone to multiple machines |
Disk usage | MySQL standalone disk capacity is almost full | Split into multiple libraries, greatly reducing the disk usage rate of database servers |
SQL execution performance | The amount of data in a single table is too large, and SQL runs slower and slower | The amount of single table data is reduced, and SQL execution efficiency is significantly improved |
If the data volume is large, the table will be divided into tables; if the concurrency is high, the database will be divided into databases.
2.3 Problems
- Join operation: The tables in the same database are distributed in different databases, making it impossible to use the join operation. This leads to us need to manually encapsulate data. For example, after you query a data in a database, you can find the corresponding data in another database based on this data.
- Transaction problem: Tables in the same database are distributed in different databases. If a single operation involves multiple databases, the transactions that come with the database cannot meet our requirements.
- Distributed id: After the database is divided, the data is spread across the database on different servers, and the database's auto-increment primary key can no longer satisfy the generated primary key uniqueness. How do we generate globally unique primary keys for different data nodes? At this time, we need to introduce distributed ids to our system.
3. Advantages and disadvantages of different library middleware
3.1 Sharding-jdbc
1. Dangdang open source client layer solution
2. Supports library and table separation, read and write separation, distributed ID generation, flexible things (maximum effort delivery type things, TCC things)
3. The community is more active
- Advantages: The advantages are that there is no need for deployment, low operation and maintenance costs, no secondary forwarding requests from the proxy layer, and high performance
- Disadvantages: Each system needs to be coupled with Sharding-jdbc dependencies. If an upgrade is required, each system needs to be redeployed.
Suitable for small and medium-sized companies
3.2 Mycat
1. proxy layer solution
2. The supported functions are perfect
3. Very popular, the community is very active
- Advantages: Transparent to each project, only middleware is required to upgrade to upgrade
- Disadvantages: You need to deploy, and you can operate and maintain a set of middleware by yourself, which has high operation and maintenance costs
Suitable for large companies
3.3 How to choose
It is recommended to use sharding-jdbc and mycat:
- Small companies use sharding-jdbc, the client layer solution is light and has low maintenance costs, and there is no need to send additional staff. Moreover, the system complexity of small and medium-sized companies will be lower and there will be less projects.
- It is best for medium and large companies to choose proxy layer solutions such as mycat, because large companies may have many systems and projects, large teams and sufficient staff, so it is best to have individuals to study and maintain mycat, and then use a large number of projects directly and transparently.
4. How to split
4.1. Horizontal splitting (used the most)
Split the data of a table into multiple tables, but the structure of each table is the same. Put the data evenly into different tables, and use multiple tables to resist concurrency.
Split according to specific fields:
For example, the order table is split according to the order ID and divided into 32 libraries, with 32 tables in each library. Then orderID%32 determines which library to fall into, orderID/32%32 determines which table to fall into.
And there are two ways to divide the database and table:
- One is divided according to range, which means that each library has a continuous piece of data. This is generally based on a time range, for example, but this is generally less used because it is easy to cause hot issues and a large amount of traffic is immersed on the latest data.
- It is to disperse evenly according to a certain field hash, which is more commonly used.
Pros and cons:
- Range is the advantage of segmenting. The expansion is that it is very simple, because you just need to prepare and prepare a library for each month. When a new month comes, you will naturally write a new library; disadvantages, but most requests are accessing the latest data. The actual production range depends on the scene.
- The advantage of hash distribution is that the data volume and request pressure of each library can be evenly distributed; the disadvantage is that it is more troublesome to expand, and there will be a data migration process. The previous data needs to be recalculated and the hash value is redistributed to different libraries or tables.
4.2. Vertical splitting
Split a table with many fields into multiple tables or multiple libraries, and separate hotspot fields with high access frequency and non-hotspot fields with low access frequency into different tables. Because the database has a cache, the fewer hotspot fields, more rows can be stored in the cache
5. After the library and table plan is determined, the following problems need to be solved
- How to realize the migration of data from single database single table to separate database and sub-table?
- What should I do if the number of tables and libraries encounters a bottleneck and how to deal with it?
- How to deal with the ID after the library is divided into different tables?
- How to deal with transactions after partitioning the database and tables?
5.1 How to design the system to smoothly never divide the library and tablesDynamic switchGo to the library and table?
Downtime migration
- One way is to directly shut down the machine and migrate, which needs to be done late at night when there is no traffic.
- Migrate the data to the database and then restart the system. This method requires downtime, not smooth migration.
Double write migration solution
This solution does not require shutdown.
Simply put, it is the addition, deletion and modification operations in the online system,In addition to adding, deleting and modifying old libraries, adding, deleting and modifying new libraries are added., this is calledDouble writing, write two libraries at the same time, the old library and the new library.
ThenSystem deploymentAfter that, the data of the new library is too far apart. Use the derivative tool mentioned earlier to run and read the old library data and write the new library. When writing, you must judge the time of the last modification of this data based on fields such as gmt_modified. Unless the read data does not exist in the new library, or it will be written if it is newer than the data of the new library. Simply put, it is not allowed to overwrite new data with old data.
After one round of derivation, it is possible that the data is still inconsistent, so the program automatically performs a round of verification to compare each data in each table of the new and old databases. Then, if there are different ones, read the data from the old database and write it again for those different ones. Loop repeatedly until the data in each table of both libraries are exactly the same.
Then when the data is completely consistent, it is OK. Based on the latest code of only using the database and sub-table, it is not just based on the database and sub-table operation. It has not had a few hours of downtime, which is very stable. So now, this is how you basically do data migration and other things.
The disadvantage of this method is that the code needs to support the excessive stages of two libraries at the same time, which is a little troublesome, but does not require downtime and can be migrated smoothly.
5.2 What is the library and table division solution for dynamic expansion and shrinking?
After the database division and table division is completed, if the data continues to increase, the original plan cannot meet the business and needs to continue to expand more databases and table divisions. So, should I do another database division and table division?
This will be very troublesome because the subsequent process may not be enough, and the migration of data in the database and table is very difficult. Then we can divide enough libraries and tables at one time to avoid the problem of subsequent data migration.
At the beginning, the library is divided into one-time. There are 32 libraries*32 tables, a total of 1024 tables. According to a certain id, first route the modulus to the library according to 32, and then route the modulus to the database table according to 32.
orderId | id % 32 (Library) | id / 32 % 32 (table) |
---|---|---|
259 | 3 | 8 |
1189 | 5 | 5 |
352 | 0 | 11 |
4593 | 17 | 15 |
At the beginning, this library might be a logical library. It is built on a database, which means that a MySQL server may have n libraries, such as 32 libraries. If you want to split it later, you can just keep migrating between the library and the MySQL server. Then the system cooperates to change the configuration.
For example, it can be extended to up to 32 database servers, each database server is a library. What if it is still not enough? It can be extended to up to 1024 database servers, with each database server having a library and a table. Because it is at most 1024 tables.
In this way, you don’t have to write your own code to do data migration. All of them are left to the DBA. However, DBA does need to do some library table migration work, but it is much more efficient than writing your own code and then extracting data to transfer data.
Here is a summary of the steps:
- Set up several database servers, several libraries on each server, and how many tables each library is. The recommended number is 32 library * 32 tables. For most companies, it may be enough for several years.
- Rules for routing, orderId modulo 32 = library, orderId / 32 modulo 32 = table
- When expanding capacity, apply to add more database servers, install MySQL, expand the capacity in multiples, 4 servers, expand to 8 servers, and then to 16 servers.
- The DBA is responsible for migrating the original database server's library to the new database server. There are some convenient tools for library migration.
- Let’s modify the configuration and adjust the address of the database server where the migrated library is located.
- Re-release the system and go online, and the original routing rules do not need to be changed. You can directly use n times the resources of the database server to continue providing services to the online system.
6. How to deal with the id primary key after dividing the library and table?
6.1. Set the database sequence or table auto-increment field step length
Suitable scenarios:
- When the user prevents the generated ID duplication, this solution is relatively simple to implement and can also achieve performance goals.
- However, the service node is fixed and the step size is fixed. If the service nodes are added in the future, it will be difficult to do it.
6.2、UUID
Not ordered and being an ID is not friendly to indexes, which will lead to excessive random write operations for B+ tree indexes when writing (continuous IDs can produce partially sequential writing), which is not suitable for primary keys.
Suitable scenarios:
- If you want to randomly generate file names, numbers, etc.
- You can use UUID, but you cannot use UUID as a primary key
6.3. Get the current system time, which may be repeated when the concurrency is high.
Suitable scenarios:
- Generally, if you use this solution, you will splice the current time with many other business fields as an id. If you think it is acceptable in the business, then it is also OK.
- You can splice other business field values with the current time to form a globally unique number
6.4. Snowflake algorithm
- The snowflake algorithm is a distributed id generation algorithm that is open source on Twitter, implemented in the Scala language.
- It is to use a 64-bit long-type id, and 1 bit is not used. 41 bits are used as the milliseconds, 10 bits are used as the working machine id, and 12 bits are used as the serial number.
7. How to handle transactions after dividing databases and tables?
After dividing the database and tables, cross-store transactions may occur. How to deal with such transactions?
7.1 Optimization solution to avoid cross-library transactions
- First of all, when we do the database and table division, try to avoid cross-store transaction scenarios.
- For example, when dividing the order table, splitting according to the order ID when dividing the database, the other tables related to the order are also split according to the order ID. In this way, transactions for an order can be controlled in one library, avoiding cross-store transactions.
7.2 Handling of cross-store transactions
- If it is unavoidable, the partitioning and tables are generally a single system that corresponds to multiple libraries, then the XA transaction of the database can handle this scenario.
- sharding-jdbc can support XA transactions of the database, and cross-store transactions can be implemented through @DSTransactional annotation.
Mycat can also support XA transactions.
Summarize
The above is personal experience. I hope you can give you a reference and I hope you can support me more.