one,Partition concept
PartitionAllows multiple parts of a single table to be allocated across the file system according to specified rules. Different parts of the table are stored as separate tables at different locations. MySQL supports Partition since 5.1.3.
Comparison of partition and manual table
Manual table classification | Partition |
Multiple data tables | A data sheet |
Risk of duplicate data | There is no risk of data duplication |
Write multiple tables | Write a table |
No unified constraints | Forced constraints |
MySQL supports RANGE, LIST, HASH, and KEY partition types, among which RANGE is the most commonly used:
- Range – This pattern allows data to be divided into different ranges. For example, a table can be divided into several partitions by year.
- Hash – This pattern allows calculation of the Hash Key of one or more columns of the table, and finally partitioning the data area corresponding to different values of this Hash code. For example, you can create a table that partitions the table primary key.
- Key (key value) - an extension of the Hash pattern above, here the Hash Key is generated by the MySQL system.
- List – This pattern allows the system to segment data by the value of the predefined list.
- Composite (Composite Mode) – The combination of the above modes
2. What can partitions do
- Logical data segmentation
- Improve single writing and reading application speed
- Improve the speed of partition-wide read query
- Splitting data can have multiple different physical file paths
- Efficiently save historical data
- Constraint checking on a table
- Different master-slave server partitioning policies, such as master partition by Hash partition, slave partition by range partition
Three, partition restrictions(Deadline5.1.44version)
• You can only partition the integer columns of the data table, or the data columns can be converted into integer columns through partitioning functions.
• The maximum number of partitions cannot exceed 1024
• If there is a unique index or primary key, the partition column must be included in all unique index or primary keys.
• Foreign keys are not supported
• Full text index is not supported
- Partitioning by date is very suitable because many date functions can be used. But there are not many suitable partition functions for strings
4. When to use partitions
• Massive data tables
• Quick query of historical tables can be done by ARCHIVE+PARTITION.
• The data table index is greater than the server's valid memory
• For large tables, especially when the index is much larger than the server's effective memory, you can not use the index, and the partition efficiency will be more effective at this time.
Five, partition experiment
Experiment 1:
Data published using US Bureau of Transportation Statistics (CSV format). Currently, it includes 113 million records, 7.5 GB data with 5.2 GB index. Time from 1987 to 2007.
The server uses 4GB of memory, so that the size of the data and indexes exceeds the memory size. The reason for setting to 4GB is that the data warehouse size is much larger than the possible memory size, which may reach several TB. For ordinary OLTP databases, index caches are in memory and can be retrieved quickly. If the data exceeds the memory size, you need to use a different approach.
Create a table with a primary key, because usually the table will have a primary key. The primary key of the table is too large to make the index unable to be read into memory, which is generally not efficient, meaning that you need to access the disk frequently, and the access speed depends entirely on your disk and processor. Currently, in large data warehouses designed, there is a common practice to not use indexes. Therefore, it will also be more capable of having and without primary keys.
Test method:
Use three types of data to induce MyISAM, InnoDB, Archive.
For each type of introductory, create an unpartitioned table with primary key (except archive) and two partitioned tables, one by month and one by year. The partition table partitioning method is as follows:
CREATE TABLE by_year (
d DATE
)
PARTITION BY RANGE (YEAR(d))
(
PARTITION P1 VALUES LESS THAN (2001),
PARTITION P2 VALUES LESS THAN (2002),
PARTITION P3 VALUES LESS THAN (2003),
PARTITION P4 VALUES LESS THAN (MAXVALUE)
)
CREATE TABLE by_month (
d DATE
)
PARTITION BY RANGE (TO_DAYS(d))
(
PARTITION P1 VALUES LESS THAN (to_days(‘2001-02-01′)), — January
PARTITION P2 VALUES LESS THAN (to_days(‘2001-03-01′)), — February
PARTITION P3 VALUES LESS THAN (to_days(‘2001-04-01′)), — March
PARTITION P4 VALUES LESS THAN (MAXVALUE)
)
Each is tested on a separate instance on mysql server, with only one library and one table per instance. Each type of introductory service will be started, the query will be run and the results will be recorded, and the service will be closed. Service instances are created with MySQL Sandbox.
The data loading is as follows:
ID | Introduce | Whether to partition | data | size | Remark | Loading time (*) |
1 | MyISAM | none | 113 million | 13 GB | with PK | 37 min |
2 | MyISAM | by month | 113 million | 8 GB | without PK | 19 min |
3 | MyISAM | by year | 113 million | 8 GB | without PK | 18 min |
4 | InnoDB | none | 113 million | 16 GB | with PK | 63 min |
5 | InnoDB | by month | 113 million | 10 GB | without PK | 59 min |
6 | InnoDB | by year | 113 million | 10 GB | without PK | 57 min |
7 | Archive | none | 113 million | 1.8 GB | no keys | 20 min |
8 | Archive | by month | 113 million | 1.8 GB | no keys | 21 min |
9 | Archive | by year | 113 million | 1.8 GB | no keys | 20 min |
*On dual-Xeon server
To compare the effect of partitions on large and small data sets, another 9 instances were created, each containing slightly less than 2GB of data.
There are two types of query statements
- Gathering query
SELECT COUNT(*)
FROM table_name
WHERE date_column BETWEEN start_date and end_date
- Specify record query
SELECT column_list
FROM table_name
WHERE column1 = x and column2 = y and column3 = z
For the first query, create statements with different date ranges. For each range, create an additional set of queries with the same range date. The first query in each date range is a cold query, which means it is the first hit, and the subsequent query in the same range is a warm query, which means it is at least partially cached. The query statement is on the Forge.
result:
1Partition table with primary key
The first test uses a composite primary key, just like the original data table uses. The primary key index file reaches 5.5 GB. It can be seen that the partition not only does not improve performance, but the primary key also slows down operations. Because if you use primary key index query, and the index cannot be read into memory, the performance will be very poor. It is prompted that partitioning is useful, but it must be used properly.
+——–+—————–+—————–+—————–+
| Status | myisam No partition | myisam monthly partition | myisam annual partition |
+——–+—————–+—————–+—————–+
| cold | 2.6574570285714 | 2.9169642 | 3.0373419714286 |
| warm | 2.5720722571429 | 3.1249698285714 | 3.1294000571429 |
+——–+—————–+—————–+—————–+
ARCHIVE
+——–+—————-+—————–+—————–+
| Status | archive not partitioned | archive monthly partition | archive annual partition |
+——–+—————-+—————–+—————–+
| cold | 249.849563 | 1.2436211111111 | 12.632532527778 |
| warm | 235.814442 | 1.0889786388889 | 12.600520777778 |
+——–+—————-+—————–+—————–+
Note that the response time of the ARCHIVE partition is better than using MyISAM.
2Partition table without primary key
Because if the size of the primary key exceeds the available key buffer, or even all memory, all queries using the primary key will use disk. The new way is to use only partitions, not primary keys. Performance has been significantly improved.
Monthly partition tables have achieved performance improvements of 70%-90%.
+——–+——————+——————+——————+
| Status | myisam No partition | myisam monthly partition | myisam annual partition |
+——–+——————+——————+——————+
| cold | 2.6864490285714 | 0.64206445714286 | 2.6343286285714 |
| warm | 2.8157905714286 | 0.18774977142857 | 2.2084743714286 |
+——–+——————+——————+——————+
To make the difference more obvious, I used two large-scale queries that could take advantage of the partition's partition elimination feature.
# query 1 – Yearly statistics
SELECT year(FlightDate) as y, count(*)
FROM flightstats
WHERE FlightDate BETWEEN “2001-01-01″ and “2003-12-31″
GROUP BY y
# query 2 – Monthly statistics
SELECT date_format(FlightDate,”%Y-%m”) as m, count(*)
FROM flightstats
WHERE FlightDate BETWEEN “2001-01-01″ and “2003-12-31″
GROUP BY m
The results show that the monthly partition table has 30%-60%, and the annual partition table has 15%-30% performance improvement.
+———-+———–+———–+———–+
| query_id | No points | Monthly points | Year points |
+———-+———–+———–+———–+
| 1 | 97.779958 | 36.296519 | 82.327554 |
| 2 | 69.61055 | 47.644986 | 47.60223 |
+———-+———–+———–+———–+
Processor factors
When the above test is tested on a home computer (Intel Dual Core 2.3 MHz CPU). For the original for dual Xeon 2.66 MHz, discovering new servers is faster! .
Repeat the above test, surprisingly:
+——–+——————-+————-+—————–+
|Status |myisam monthly partition| myisam annual partition|
+——–+——————-+————-+—————–+
| cold | 0.051063428571429 | 0.6577062 | 1.6663527428571 |
| warm | 0.063645485714286 | 0.1093724 | 1.2369152285714 |
+——–+——————-+————-+—————–+
myisam tables with primary keys without partitions are faster than partitioned tables. Partitioned tables perform the same as before, but the performance of unpartitioned tables improves, making partitions unnecessary. Now that this server seems to take full advantage of indexing, I have indexed on the partition column of the partition table.
# Original table
create table flightstats (
AirlineID int not null,
UniqueCarrier char(3) not null,
Carrier char(3) not null,
FlightDate date not null,
FlightNum char(5) not null,
TailNum char(8) not null,
ArrDelay double not null,
ArrTime datetime not null,
DepDelay double not null,
DepTime datetime not null,
Origin char(3) not null,
Dest char(3) not null,
Distance int not null,
Cancelled char(1) default ‘n',
primary key (FlightDate, AirlineID, Carrier, UniqueCarrier, FlightNum, Origin, DepTime, Dest)
)
# Partition table
create table flightstats (
AirlineID int not null,
UniqueCarrier char(3) not null,
Carrier char(3) not null,
FlightDate date not null,
FlightNum char(5) not null,
TailNum char(8) not null,
ArrDelay double not null,
ArrTime datetime not null,
DepDelay double not null,
DepTime datetime not null,
Origin char(3) not null,
Dest char(3) not null,
Distance int not null,
Cancelled char(1) default ‘n',
KEY (FlightDate)
)
PARTITION BY RANGE …
The result is satisfactory, with a 35% performance improvement.
+——–+——————-+——————-+——————-+
|Status |myisam monthly partition | myisam annual partition |
+——–+——————-+——————-+——————-+
| cold | 0.075289714285714 | 0.025491685714286 | 0.072398542857143 |
| warm | 0.064401257142857 | 0.031563085714286 | 0.056638085714286 |
+——–+——————-+——————-+——————-+
in conclusion:
1. Using table partitioning is not a guarantee of performance improvement. It depends on the following factors:
- the column used for partitioning;
- Partition function, if the original field is not int type;
- Server speed;
- The amount of memory.
2. Run benchmarks and performance tests before applying to production systems
Depend on the purpose of your database, you may get huge performance improvements or you may gain nothing. If you are not careful, it may even degrade performance.
For example: a table that uses monthly partitions can get extremely good speeds when always querying date ranges. But if there is no date query, a full table scan will be performed.
Partitioning is a key tool for improving performance of massive data. What is massive data depends on the hardware deployed. Blind use of partitions does not guarantee performance improvements, but it can be the perfect solution with the help of previous benchmarks and performance testing.
3. Archive Tables can be a good compromise
Archive table partitioning can achieve huge performance improvements. Of course, it also depends on your purpose. Any query is a full table scan when there is no partition. If you have historical data that does not need to be changed, you also need to analyze and count by time. Using Archive is an excellent choice. It will use 10-20% of the original space and has better performance for aggregation queries than MyISAM/InnoDB tables.
Although a well-optimized partitioned MyISAM table may perform better than the corresponding Archive table, it requires 10 times the space.
Experiment 2:
1. Create two tables, one partitioned by time field and the other does not partition.
CREATE TABLE part_tab
(
c1 int default NULL,
c2 varchar(30) default NULL,
c3 date default NULL
) engine=myisam
PARTITION BY RANGE (year(c3)) (PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) ,
PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) ,
PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) ,
PARTITION p7 VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) ,
PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE );
create table no_part_tab
(c1 int(11) default NULL,
c2 varchar(30) default NULL,
c3 date default NULL) engine=myisam;
2. Create a stored procedure and use this procedure to insert 8 million different pieces of data into two tables.
delimiter //
CREATE PROCEDURE load_part_tab()
begin
declare v int default 0;
while v < 8000000
do
insert into part_tab
values (v,'testing partitions',adddate(‘1995-01-01′,(rand(v)*36520) mod 3652));
set v = v + 1;
end while;
end
//
Then execute
mysql> delimiter ;
mysql> call load_part_tab();
Query OK, 1 row affected (8 min 17.75 sec)
mysql> insert into no_part_tab select * from part_tab;
Query OK, 8000000 rows affected (51.59 sec)
Records: 8000000 Duplicates: 0 Warnings: 0
3. Start a simple range query of the data in these two tables. And display the execution process analysis:
mysql> select count(*) from no_part_tab where c3 > date ‘1995-01-01′ and c3 < date ‘1995-12-31′;
+———-+
| count(*) |
+———-+
| 795181 |
+———-+
1 row in set (38.30 sec)
mysql> select count(*) from part_tab where c3 > date ‘1995-01-01′ and c3 < date ‘1995-12-31′;
+———-+
| count(*) |
+———-+
| 795181 |
+———-+
1 row in set (3.88 sec)
mysql> explain select count(*) from no_part_tab where c3 > date ‘1995-01-01′ and c3 < date ‘1995-12-31′\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: no_part_tab
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 8000000
Extra: Using where
1 row in set (0.00 sec)
mysql> explain partitions select count(*) from part_tab where
-> c3 > date ‘1995-01-01′ and c3 < date ‘1995-12-31′\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: part_tab
partitions: p1
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 798458
Extra: Using where
1 row in set (0.00 sec)
From the above results, we can see that using table partitions reduces response time by 90% compared to non-partitions. The Explain program can see that only the first partition has been scanned during the query of the partitioned table, and the rest has been skipped. Further tests:
– Add date range
mysql> select count(*) from no_part_tab where c3 > date ‘-01-01′and c3 < date ‘1997-12-31′;
+———-+
| count(*) |
+———-+
| 2396524 |
+———-+
1 row in set (5.42 sec)
mysql> select count(*) from part_tab where c3 > date ‘-01-01′and c3 < date ‘1997-12-31′;
+———-+
| count(*) |
+———-+
| 2396524 |
+———-+
1 row in set (2.63 sec)
– Add unindexed field query
mysql> select count(*) from part_tab where c3 > date ‘-01-01′and c3 < date
‘1996-12-31′ and c2='hello';
+———-+
| count(*) |
+———-+
| 0 |
+———-+
1 row in set (0.75 sec)
mysql> select count(*) from no_part_tab where c3 > date ‘-01-01′and c3 < da
te ‘1996-12-31′ and c2='hello';
+———-+
| count(*) |
+———-+
| 0 |
+———-+
1 row in set (11.52 sec)
in conclusion:
- Partitioned and unpartitioned take up roughly the same file space (data and index files)
- If there are unindexed fields in the query statement, the partition time is much better than the unpartitioned time
- If the fields in the query statement are indexed, the difference between partition and unpartitioned narrows, and partitioning is slightly better than unpartitioned.
- For large data volumes, it is recommended to use partitioning function.
- Remove unnecessary fields
- According to the manual, adding myisam_max_sort_file_size will increase partition performance