PostgreSQL Tutorial (III): Detailed explanation of table inheritance and partition table

1. Inheritance of tables:

This concept is somewhat unfamiliar to many developers who are already familiar with other database programming, but its implementation method and design principles are simple and easy to understand. Let's start with a simple example now.
1. The first inheritance table:

Copy the codeThe code is as follows:

CREATE TABLE cities (   --Parent table

        name        text,

        population float,

        altitude     int

    );
CREATE TABLE capitals ( --Subtable

        state      char(2)

    ) INHERITS (cities);

The capitals table inherits all properties of the cities table. In PostgreSQL, a table can inherit attributes from zero or more other tables, and a query can reference both all rows in the parent table, plus all rows in its child table, where the latter is the default behavior.

Copy the codeThe code is as follows:

MyTest=# INSERT INTO cities values('Las Vegas', 1.53, 2174); --Insert parent table

    INSERT 0 1
MyTest=# INSERT INTO cities values('Mariposa',3.30,1953);     --Insert the parent table

    INSERT 0 1
MyTest=# INSERT INTO capitals values('Madison',4.34,845,'WI');--Insert subtable

    INSERT 0 1
MyTest=# SELECT name, altitude FROM cities WHERE altitude > 500; --The data of both the parent and child tables are fetched.

       name     | altitude

    -----------+----------

     Las Vegas |     2174

     Mariposa   |     1953

     Madison    |      845

    (3 rows)

MyTest=# SELECT name, altitude FROM capitals WHERE altitude > 500; --Only the data of the subtable are fetched.

      name   | altitude

    ---------+----------

     Madison |      845

    (1 row)

If you want to extract data only from the parent table, you need to add the ONLY keyword to SQL, such as:

Copy the codeThe code is as follows:

    MyTest=# SELECT name,altitude FROM ONLY cities WHERE altitude > 500;

       name     | altitude

    -----------+----------

     Las Vegas |     2174

     Mariposa   |     1953

    (2 rows)

The "ONLY" keyword before cities in the above example means that the query should only search cities without including tables with inheritance levels below cities. Many of the commands we have discussed - SELECT, UPDATE and DELETE - support this "ONLY" symbol.
When executing the entire table data deletion, if the parent table is directly truncate, the data of the parent table and all its child tables will be deleted. If it is just truncate the child table, the data of the parent table will not change, but the data in the child table will be cleared.

Copy the codeThe code is as follows:

MyTest=# TRUNCATE TABLE cities;  --The data of both parent and child tables are deleted.

    TRUNCATE TABLE

    MyTest=# SELECT * FROM capitals;

     name | population | altitude | state

    ------+------------+----------+-------

    (0 rows)

2. Determine the source of the data:
Sometimes you may want to know which table a record comes from. In each table we have a system implicit field tableoid, which can tell you the source of the table:

Copy the codeThe code is as follows:

    MyTest=# SELECT tableoid, name, altitude FROM cities WHERE altitude > 500;

     tableoid |   name    | altitude

    ----------+-----------+----------

        16532 | Las Vegas |     2174

        16532 | Mariposa  |     1953

        16538 | Madison   |      845

    (3 rows)

The above result only gives tableoid. We still cannot see the actual table name through this value alone. To accomplish this, we need to associate it with the system table pg_class to extract the actual table name from the table via the tableoid field, see the following query:

Copy the codeThe code is as follows:

    MyTest=# SELECT , ,  FROM cities c,pg_class p WHERE  > 500 and  = ;

     relname  |   name    | altitude

    ----------+-----------+----------

     cities    | Las Vegas |     2174

     cities    | Mariposa   |     1953

     capitals | Madison    |      845

    (3 rows)

3. Notes on data insertion:
Inheritance does not automatically populate data from INSERT or COPY to other tables in the inheritance level. In our example, the following INSERT statement will not succeed:

Copy the codeThe code is as follows:

    INSERT INTO cities (name, population, altitude, state) VALUES ('New York', NULL, NULL, 'NY');

 

We may want the data to be passed into the capitals table, but this will not happen: INSERT always inserts the explicitly declared table.

4. Multi-table inheritance:
A table can be inherited from multiple parent tables, in which case it has the sum of the fields of the parent table. Any fields defined in the subtable will also be added. If the same field name appears in multiple parent tables, or appears in the definitions of the parent table and the child table at the same time, then these fields will be "fusion", so that there is only one such field in the child table. To fuse, the fields must be of the same data type, otherwise an error will be thrown. The fused field will have all the constraints of the field it inherits.

Copy the codeThe code is as follows:

    CREATE TABLE parent1 (FirstCol integer);

    CREATE TABLE parent2 (FirstCol integer, SecondCol varchar(20));

    CREATE TABLE parent3 (FirstCol varchar(200)); 
--Sub table child1 will inherit from both parent1 and parent2 tables, and both parent tables contain the FirstCol field of type integer, so child1 can be created successfully.

    CREATE TABLE child1 (MyCol timestamp) INHERITS (parent1,parent2);
--Sub table child2 will not be created successfully because both of its parent tables contain FirstCol fields, but their types are different.

    CREATE TABLE child2 (MyCol timestamp) INHERITS (parent1,parent3);
--Sub table child3 will also not be created successfully because it and its parent table contain FirstCol fields, but their types are different.

    CREATE TABLE child3 (FirstCol varchar(20)) INHERITS(parent1);

5. Inheritance and permissions:

Table access permissions are not automatically inherited. Therefore, a user trying to access the parent table must also have permission to access all its child tables, or use the ONLY keyword to extract data only from the parent table. When adding a new child table to an existing inheritance hierarchy, be careful to give it all permissions.
A serious limitation of inheritance features is that indexes (including unique constraints) and foreign key constraints are only applied to a single table, not including their inherited child tables. This is true for both the reference table and the reference table, so in the above example, if we declare it as UNIQUE or a PRIMARY KEY, it will not prevent the capitals table from having cities data rows with duplicate names. And these duplicate rows will be displayed by default when querying the cities table. In fact, capitals will have no unique constraints at all by default, so it may contain multiple lines with the same name. You should add unique constraints to capitals, but doing so will not avoid duplication with cities. Similarly, if we declare REFERENCES some other tables, this constraint will not be automatically broadcasted to capitals. Under this condition, you can do this by manually adding the same REFERENCES constraint to capitals.

2. Partition table:

1. Overview of the partition table:
Partitioning means dividing a large logical table into several physical blocks. Partitioning can provide several benefits:
1). Some types of query performance can be greatly improved.
2). Updated performance can also be improved because the index of each block of the table is smaller than the index on the entire dataset. If the index cannot be placed all in memory, then read and write on the index will generate more disk access.
3). Batch deletion can be achieved by simply deleting a certain partition.
4). Move rarely used data to cheaper, slower storage media.
Assume that the current database does not support partitioned tables, and the amount of data that our application needs to process is also very large. For this application scenario, we have to manually split the large table into multiple small tables according to certain rules, so that each small table contains data in different intervals. In this way, we must first calculate the small tables that need to be operated on by this instruction before data is inserted, updated, deleted and queryed. For some queries, since the query interval may span multiple small tables, we have to union the query results of multiple small tables to merge data from multiple tables and finally form a result set to return to the client. It can be seen that if the database we are using does not support partition tables, then in scenarios that are suitable for their application, we need to do a lot of additional programming work to make up for this lack. However, it should be noted that although the functions can be barely coped, the performance cannot be compared with the partition table.
Currently, the partition forms supported by PostgreSQL are mainly the following two types:
1). Range partition: The table is partitioned into "ranges" by one or more key fields, and values that do not overlap between these ranges are distributed to different partitions. For example, we can partition specific business objects based on data ranges, or partition based on identifier ranges.
2). List partition: The table is implemented by explicitly listing the key values that should appear in each partition.

2. Implement partitioning:
1). Create a "main table" from which all partitions are inherited.

Copy the codeThe code is as follows:

CREATE TABLE measurement (               --Main table

        city_id      int    NOT NULL,

        logdate     date  NOT NULL,

        peaktemp int,

    );

2). Create several "sub" tables, each inherited from the main table. Usually, these "sub" tables will not add any more fields. We will call subtables partitions, although they are ordinary PostgreSQL tables.

Copy the codeThe code is as follows:

    CREATE TABLE measurement_yy04mm02 ( ) INHERITS (measurement);

    CREATE TABLE measurement_yy04mm03 ( ) INHERITS (measurement);

    ...

    CREATE TABLE measurement_yy05mm11 ( ) INHERITS (measurement);

    CREATE TABLE measurement_yy05mm12 ( ) INHERITS (measurement);

    CREATE TABLE measurement_yy06mm01 ( ) INHERITS (measurement);

The subtables created above are divided into the range in the form of years and months, and the data of different years and months will belong to different subtables. Such an implementation method will be extremely convenient and efficient for clearing partitioned data, that is, directly execute the DROP TABLE statement to delete the corresponding subtable, and then consider whether to rebuild the subtable (partition) according to the actual application. Compared to direct DROP subtables, PostgreSQL also provides another more convenient way to manage subtables:

Copy the codeThe code is as follows:

ALTER TABLE measurement_yy06mm01 NO INHERIT measurement;

Compared with direct DROP, this method is simply to separate the sub-table from the original main table, and the data stored in the sub-table can still be accessed, because the table has been restored to an ordinary data table at this time. In this way, for the database DBA, you can perform necessary maintenance operations on the table at this time, such as data cleaning, archiving, etc. After completing many routine operations, you can consider whether to delete the table directly (DROP TABLE), or to clear the data of the table first (TRUNCATE TABLE), and then let the table inherit the main table again, such as:

Copy the codeThe code is as follows:

ALTER TABLE measurement_yy06mm01 INHERIT measurement;

3). Add constraints to the partition table and define the allowed skill values for each partition. At the same time, it should be noted that the defined constraints must ensure that there will be no same key values in different partitions. Therefore, we need to modify the definition of the above "sub" table to the following form:

Copy the codeThe code is as follows:

    CREATE TABLE measurement_yy04mm02 (

        CHECK ( logdate >= DATE '2004-02-01' AND logdate < DATE '2004-03-01')

    ) INHERITS (measurement);

    CREATE TABLE measurement_yy04mm03 (

        CHECK (logdate >= DATE '2004-03-01' AND logdate < DATE '2004-04-01')

    ) INHERITS (measurement);

    ...

    CREATE TABLE measurement_yy05mm11 (

        CHECK (logdate >= DATE '2005-11-01' AND logdate < DATE '2005-12-01')

    ) INHERITS (measurement);

    CREATE TABLE measurement_yy05mm12 (

        CHECK (logdate >= DATE '2005-12-01' AND logdate < DATE '2006-01-01')

    ) INHERITS (measurement);

    CREATE TABLE measurement_yy06mm01 (

        CHECK (logdate >= DATE '2006-01-01' AND logdate < DATE '2006-02-01')

    ) INHERITS (measurement);

4). Create indexes based on key values as much as possible. If needed, we can also create indexes for other fields in the subtable.

Copy the codeThe code is as follows:

    CREATE INDEX measurement_yy04mm02_logdate ON measurement_yy04mm02 (logdate);

    CREATE INDEX measurement_yy04mm03_logdate ON measurement_yy04mm03 (logdate);

    ...

    CREATE INDEX measurement_yy05mm11_logdate ON measurement_yy05mm11 (logdate);

    CREATE INDEX measurement_yy05mm12_logdate ON measurement_yy05mm12 (logdate);

    CREATE INDEX measurement_yy06mm01_logdate ON measurement_yy06mm01 (logdate);

5). Define a rule or trigger to redirect modifications to the main table to the appropriate partition table.
If the data only enters the latest partition, we can set up a very simple rule to insert the data. We must redefine this rule every month, which is to modify the subtable name of the redirected insert so that it always points to the current partition.

Copy the codeThe code is as follows:

    CREATE OR REPLACE RULE measurement_current_partition AS

    ON INSERT TO measurement

    DO INSTEAD

    INSERT INTO measurement_yy06mm01 VALUES (NEW.city_id, , );

where NEW is a keyword, representing a collection of new data fields. Here, each field in the set can be obtained by using the dot (.) operator.
We may want to insert data and want the server to automatically locate which partition to insert data to. We can achieve this with a more complex rule set like the one below.

Copy the codeThe code is as follows:

    CREATE RULE measurement_insert_yy04mm02 AS

    ON INSERT TO measurement WHERE (logdate >= DATE '2004-02-01' AND logdate < DATE '2004-03-01')

    DO INSTEAD

    INSERT INTO measurement_yy04mm02 VALUES (NEW.city_id, , );

    ...

    CREATE RULE measurement_insert_yy05mm12 AS

    ON INSERT TO measurement WHERE (logdate >= DATE '2005-12-01' AND logdate < DATE '2006-01-01')

    DO INSTEAD

    INSERT INTO measurement_yy05mm12 VALUES (NEW.city_id, , );

    CREATE RULE measurement_insert_yy06mm01 AS

    ON INSERT TO measurement WHERE (logdate >= DATE '2006-01-01' AND logdate < DATE '2006-02-01')

    DO INSTEAD

    INSERT INTO measurement_yy06mm01 VALUES (NEW.city_id, , );

Please note that the WHERE clause in each rule matches the CHECK constraint of its partition.
It can be seen that a complex partitioning scheme may require quite a lot of DDL. In the example above we need to create a new partition once a month, so it is wise to write a script to automatically generate the required DDL. In addition, it is not difficult to infer that partition tables have certain suppression on batch insertion operations of new data, which is the same in Oracle.
In addition to the above-mentioned redirecting the data of the main table to each subtable through Rule, we can also do this through triggers. Compared with the Rule-based redirection method, the trigger-based method may bring better insertion efficiency, especially for non-batch insertion cases. However, for batch insertion, since the extra overhead of Rule is table-based rather than row-based, the effect will be better than the trigger method. Another thing to note is that the copy operation will ignore Rules. If we want to insert data through the COPY method, you can only copy the data directly to the correct subtable, not the main table. This limitation will not cause any problem with the trigger. There is another problem with the Rule-based redirection method, that is, when the inserted data is not in the constraints of any subtable, PostgreSQL will not report an error, but will directly retain the data in the main table.

6). Add a new partition:

Here we will introduce two ways to add new partitions. The first method is simple and intuitive. We just create a new subtable and define new check constraints for it, such as:

Copy the codeThe code is as follows:

    CREATE TABLE measurement_y2008m02 (

        CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' )

    ) INHERITS (measurement);

The creation steps of the second method are relatively cumbersome, but more flexible and practical. See the following four steps:

Copy the codeThe code is as follows:

/* Create an independent data table (measurement_y2008m02) which is templated when created using the future main table (measurement) as a template, including the default value (DEFAULTS) and consistency constraints (CONSTRAINTS) of the template table.  */

    CREATE TABLE measurement_y2008m02

        (LIKE measurement INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
/* Check constraints that need to be used when creating future subtables for this table.  */

    ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02

        CHECK (logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01');
/* Import data to this table.  The following is just an example of how to import data.  After importing the data, further data processing can be performed if possible, such as data conversion, filtering, etc.  */

    \copy measurement_y2008m02 from 'measurement_y2008m02'
/* Let the table inherit the main table when appropriate, or when needed.  */

    ALTER TABLE measurement_y2008m02 INHERIT measurement;

7). Make sure that the configuration parameter constraint_exclusion in it is open. Without this parameter, the query will not be optimized as needed. What we need to do here is to make sure that the option is not commented out in the configuration file.

Copy the codeThe code is as follows:

    /> pwd

    /opt/PostgreSQL/9.1/data

    /> cat  | grep "constraint_exclusion"

    constraint_exclusion = partition        # on, off, or partition

3. Partition and constraint exclusion:
Constraint exclusion is a query optimization technique that improves the performance of table partitions defined using the above method. for example:

Copy the codeThe code is as follows:

    SET constraint_exclusion = on;

    SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';

If there is no constraint exclusion, the above query will scan every partition in the measurement table. After the constraint exclusion is turned on, the planner will check the constraints for each partition before viewing it to prove that the partition does not need to be scanned because it cannot contain any data rows that meet the WHERE clause conditions. If the planner can prove this, it excludes the partition from the query plan.
You can use the EXPLAIN command to display the difference in constraint_exclusion when the constraint_exclusion is turned on and off. A typical default plan for a table set using the above method is:

Copy the codeThe code is as follows:

    SET constraint_exclusion = off;
    EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';
                                              QUERY PLAN
    -----------------------------------------------------------------------------------------------
     Aggregate (cost=158.66..158.68 rows=1 width=0)
       -> Append (cost=0.00..151.88 rows=2715 width=0)
             -> Seq Scan on measurement (cost=0.00..30.38 rows=543 width=0)
                   Filter: (logdate >= '2006-01-01'::date)
             -> Seq Scan on measurement_yy04mm02 measurement (cost=0.00..30.38 rows=543 width=0)
                   Filter: (logdate >= '2006-01-01'::date)
             -> Seq Scan on measurement_yy04mm03 measurement (cost=0.00..30.38 rows=543 width=0)
                   Filter: (logdate >= '2006-01-01'::date)
    ...
             -> Seq Scan on measurement_yy05mm12 measurement (cost=0.00..30.38 rows=543 width=0)
                   Filter: (logdate >= '2006-01-01'::date)
             -> Seq Scan on measurement_yy06mm01 measurement (cost=0.00..30.38 rows=543 width=0)
                   Filter: (logdate >= '2006-01-01'::date)

As can be seen from the query plan above, PostgreSQL scans all partitions. Let’s take a look at the query plan after turning on constraint exclusion:

Copy the codeThe code is as follows:

    SET constraint_exclusion = on;

    EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';    

                                              QUERY PLAN

    -----------------------------------------------------------------------------------------------

     Aggregate  (cost=63.47..63.48 rows=1 width=0)

       ->  Append  (cost=0.00..60.75 rows=1086 width=0)

             ->  Seq Scan on measurement  (cost=0.00..30.38 rows=543 width=0)

                   Filter: (logdate >= '2006-01-01'::date)

             ->  Seq Scan on measurement_yy06mm01 measurement  (cost=0.00..30.38 rows=543 width=0)

                   Filter: (logdate >= '2006-01-01'::date)

Note that constraint exclusion is driven only by CHECK constraints, not by indexes.
The default value of this configuration in the current version of PostgreSQL is partition, which is a behavior between on and off, that is, the planner will only apply constraint exclusion to queries based on partition tables, while the on setting will perform constraint exclusion for all queries. Therefore, for ordinary data tables, you will have to bear the additional overhead incurred by this mechanism.

There are several precautions for constraint exclusion when using:
1). Constraint exclusion only takes effect when the WHERE clause of the query contains constraints. A parameterized query will not be optimized because the planner does not know which partition the parameter will select at runtime. Therefore functions like CURRENT_DATE must be avoided. Joining partition key values with fields in another table will not be optimized.
2). Comparisons across data types should be avoided in the CHECK constraint, because the planner will not be able to prove that such conditions are false. For example, the following constraints will be available when x is an integer field, but cannot be used when x is a bigint:
    CHECK (x = 1)
For bigint fields, we must use constraints like the following:
    CHECK (x = 1::bigint)
This problem is not limited to the bigint data type, it may occur in situations where the default data type of any constraint does not match the data type of the field it compares. Comparisons across data types in submitted queries are usually OK, but they cannot be in the CHECK condition.
3). The UPDATE and DELETE commands on the main table do not perform constraint exclusion.
4). When the planner performs constraint exclusion, all constraints of all partitions on the main table will be checked, so a large number of partitions will significantly increase the time for query planning.
5). When executing the ANALYZE statement, execute the command for each partition, rather than just executing the command on the main table.