Detailed explanation of the file triggering method of postgres main and backup switching

For this article, please refer to the book PostgresSQL Practical Practice.

Test environment for this document:

Main library IP: 192.168.40.130 Host name: postgres Port: 5442

Library IP: 192.168.40.131 Host name: postgreshot Port: 5442

The main and backup switching of the PostgreSQL 9.0 version of stream replication can only be performed by creating a trigger file. This section will introduce this main and backup switching method. The test environment is a one-main and backup asynchronous stream replication environment. The database on postgres is the main library, and the database on postgres is the backup library. The main steps of manual main and backup switching of the file triggering method are as follows:

1) Configure the trigger_file parameter of the library file and set the trigger file path and name of the activation library.

2) Close the main library, it is recommended to use -m fast mode to turn it off.

3) Create a trigger file on the backup library to activate the backup library. If it becomes, it means that the backup library has been switched to the main library.

4) At this time, you need to switch the old master library to the backup library, and create a file in the $PGDATA directory of the old master library (if there is no file in this directory, you can copy one according to the $PGHOME/share/ template file. If there is a file in this directory, you need to rename the file to). The configuration is the same as the old slave library, just change the IP in the primary_conninfo parameter to the peer IP.

5) Start the old main library and observe whether the main and standby processes are normal. If it is normal, it means that the main and standby switch is successful.

1. First configure it on the backup library, as shown below:

[postgres@postgreshot pg11]$ cat  | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # . 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgreshot pg11]$

trigger_file can be configured as a normal file or a hidden file. After adjusting the above parameters, you need to restart the backup library to make the configuration parameters take effect.

2. Close the main library as follows:

[postgres@postgres pg11]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@postgres pg11]$

3. Create a trigger file activation library on the library, as shown below:

[postgres@postgreshot pg11]$ ll  
-rwx------ 1 postgres postgres 5.9K Mar 26 18:47 
[postgres@postgreshot pg11]$ 
[postgres@postgreshot pg11]$ touch /home/postgres/pg11/trigger
[postgres@postgreshot pg11]$ ll recovery*
-rwx------ 1 postgres postgres 5.9K Mar 26 18:47 
[postgres@postgreshot pg11]$

The trigger file name and path must be consistent with the configuration file trigger_file. When you check the recovery file again, you find that the original .conf has changed to .done

Check the database logs for the backup database as shown below:

2019-03-26 23:30:19.399 EDT [93162] LOG: replication terminated by primary server
2019-03-26 23:30:19.399 EDT [93162] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
2019-03-26 23:30:19.399 EDT [93162] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-03-26 23:30:19.399 EDT [93158] LOG: invalid record length at 0/50003D0: wanted 24, got 0
2019-03-26 23:30:19.405 EDT [125172] FATAL: could not connect to the primary server: server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
2019-03-26 23:30:24.410 EDT [125179] FATAL: could not connect to the primary server: could not connect to server: Connection refused
  Is the server running on host "192.168.40.130" and accepting
  TCP/IP connections on port 5442?
2019-03-26 23:31:49.505 EDT [93158] LOG: trigger file found: /home/postgres/pg11/trigger
2019-03-26 23:31:49.506 EDT [93158] LOG: redo done at 0/5000360
2019-03-26 23:31:49.506 EDT [93158] LOG: last completed transaction was at log time 2019-03-26 19:03:11.202845-04
2019-03-26 23:31:49.516 EDT [93158] LOG: selected new timeline ID: 4
2019-03-26 23:31:50.063 EDT [93158] LOG: archive recovery complete
2019-03-26 23:31:50.083 EDT [93157] LOG: database system is ready to accept connections

According to the above information of the backup library, since the main library is closed, the log first shows that the main library cannot be connected to the main library, and then the trigger file is found. Then the recovery is successful and the database switches to read and write mode.

At this time, verify it according to the pg_controldata output, as shown below:

[postgres@postgreshot ~]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgreshot ~]$

The above shows that the database role is already the main library role. Create a table named test_alived on postgreshot and insert data, as shown below:

postgres=# CREATE TABLE test_alived2(id int4);
CREATE TABLE
postgres=# INSERT INTO test_alived2 VALUES(1);
INSERT 0 1
postgres=#

4. Prepare to switch the old main library to the backup library role and configure it on the old main library as shown below:

[postgres@postgres pg11]$ cat  | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # . 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgres pg11]$

The above configuration is basically the same as the configuration file on postgreshot, except that the host option of the primary_conninfo parameter is configured as the peer host IP.

Then create the ~/.pgpass file in the postgres host user directory, as shown below:

[postgres@pghost1 ~]$ touch ~/.pgpass
[postgres@pghost1 ~]$ chmod 600 ~/.pgpass

And insert the following content in the ~/.pgpass file:

[postgres@postgres ~]$ cat .pgpass
192.168.40.130:5442:replication:replica:replica
192.168.40.131:5442:replication:replica:replica
[postgres@postgres ~]

Then start the database on postgres as follows:

[postgres@postgres ~]$ pg_ctl start
waiting for server to start....2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv4 address "0.0.0.0", port 5442
2019-03-26 23:38:50.424 EDT [55380] LOG: listening on IPv6 address "::", port 5442
2019-03-26 23:38:50.443 EDT [55380] LOG: listening on Unix socket "/tmp/..5442"
2019-03-26 23:38:50.465 EDT [55381] LOG: database system was shut down in recovery at 2019-03-26 23:38:20 EDT
2019-03-26 23:38:50.465 EDT [55381] LOG: entering standby mode
2019-03-26 23:38:50.483 EDT [55381] LOG: consistent recovery state reached at 0/50003D0
2019-03-26 23:38:50.483 EDT [55381] LOG: invalid record length at 0/50003D0: wanted 24, got 0
2019-03-26 23:38:50.483 EDT [55380] LOG: database system is ready to accept read only connections
 done
server started
[postgres@postgres ~]$ 2019-03-26 23:38:50.565 EDT [55385] LOG: fetching timeline history file for timeline 4 from primary server
2019-03-26 23:38:50.588 EDT [55385] LOG: started streaming WAL from primary at 0/5000000 on timeline 3
2019-03-26 23:38:50.589 EDT [55385] LOG: replication terminated by primary server
2019-03-26 23:38:50.589 EDT [55385] DETAIL: End of WAL reached on timeline 3 at 0/50003D0.
2019-03-26 23:38:50.592 EDT [55381] LOG: new target timeline is 4
2019-03-26 23:38:50.594 EDT [55385] LOG: restarted WAL streaming at 0/5000000 on timeline 4
2019-03-26 23:38:50.717 EDT [55381] LOG: redo starts at 0/50003D0
 
[postgres@postgres ~]$ pg_controldata | grep cluster
Database cluster state:  in archive recovery
[postgres@postgres ~]$ 
 
postgres=# select * from test_alived2;
 id 
----
 1
(1 row)
 
postgres=#

At the same time, there is a WAL receiving process on postgres and a WAL sending process on postgreshot, which means that the old main library has successfully switched to a backup library. The above are all the steps of the main and backup switching.

Why do I need to cleanly close the main library in step 2? When the database is closed, checkpoint is first done. After completion, the WAL sending process is notified that the WAL sending process is about to be closed. The WAL sending process will send the WAL log stream that ends this checkpoint to the WAL receiving process of the WAL library. After the backup node receives the WAL log stream last sent by the main library, it applies WAL, thereby achieving a consistent state with the main library.

Another issue that needs to be noted is if the main library host fails abnormally, if the backup library is activated, will the backup library data be exactly the same as the main library? This environment is a one-main and one-standard asynchronous stream replication environment. The backup library and the main library are asynchronous synchronization methods, and there is a delay. At this time, the WAL that has committed transactions on the main library may have not had time to send to the backup library, and the main library host has already been down. Therefore, the asynchronous stream replication library may have a risk of transaction loss.

pg_ctl promote mode for main and backup switching

The above introduces the main and backup switching in file triggering. The PostgreSQL 9.1 version has begun to support the pg_ctl promote triggering method, which is more convenient to operate than the file triggering method. The syntax of the promote command is as follows:

pg_ctl promote [-D datadir]

-D refers to the data directory, if not specified, the value set using the environment variable $PGDATA. After the promote command is issued, the running backup library will stop the recovery mode and switch to the main library in read and write mode.

The main and backup switching steps of pg_ctl promote are roughly the same as the file triggering method, except that the trigger_file parameter in the configuration file is not required in step 1, and in step 3, it is replaced by pg_ctl promote to switch the main and backup mode, as follows:

1) Close the main library, it is recommended to use -m fast mode to turn it off.

2) Execute the pg_ctl promote command on the backup library to activate the backup library. If it becomes, it means that the backup library has been switched to the main library.

3) At this time, you need to switch the old master library to a backup library, and create a file in the $PGDATA directory of the old master library (if there is no file in this directory, you can copy one according to the $PGHOME/share/ template file. If there is a file in this directory, you need to rename the file to). The configuration is the same as the old slave library, just change the IP in the primary_conninfo parameter to the peer IP.

4) Start the old main library and observe whether the main and standby processes are normal. If it is normal, it means that the main and standby switch is successful. The above are the main steps of pg_ctl promote main and backup switching. This section will not be demonstrated. In the next section, an example of using pg_rewind tool will be given to use pg_ctl promote for main and backup switching.

pg_rewind

pg_rewind is a very good data synchronization tool for stream replication maintenance. In the previous section, the main five steps are mentioned to perform master-support switching. Step 2 is to close the main library before activating the backup library. What will happen if you do not do Step 2? Let’s demonstrate the following example. The test environment is a one-main and one-standard asynchronous stream replication environment. The database on postgres is the main library, and the database on postgres is the backup library.

Main and backup switch

--Standby node configuration: Operation on postgreshot

The backup library configuration is as follows:

[postgres@postgreshot pg11]$ cat  | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.130 port=5442 user=replica application_name=pg1'  # . 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgreshot pg11]$

--Activate the backup node: Operation on postgreshot

Check the flow replication status and ensure that the following command is executed on the backup library host to activate the backup library, as shown below

[postgres@postgreshot pg11]$ pg_ctl promote -D $PGDATA
waiting for server to promote.... done
server promoted
[postgres@postgreshot pg11]$ 
[postgres@postgreshot pg11]$

Check the database log of the backup database and you can see that the database is open normally and receives information about external connections. This indicates that the activation is successful. Check the database role on postgreshot, as shown below:

[postgres@postgreshot pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgreshot pg11]$

From the output of pg_controldata, you can also see that the database on postgreshot has become the main library, which means that the database on postgreshot has been switched to the main library. At this time, the old main library (the database on postgres) is still running. We plan to convert the role on postgres into a backup library. First, check the database role on postgres, as shown below

[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgres pg11]$

--After the backup node is activated, create a test table and insert data

postgres=# create table test_1(id int4);
CREATE TABLE
postgres=# insert into test_1(id) select n from generate_series(1,10) n;
INSERT 0 10
postgres=#

--Stop the original master node: operation on postgres

[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in production
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ pg_ctl stop -m fast -D $PGDATA
2019-03-27 01:10:46.714 EDT [64858] LOG: received fast shutdown request
waiting for server to shut down....2019-03-27 01:10:46.716 EDT [64858] LOG: aborting any active transactions
2019-03-27 01:10:46.717 EDT [64858] LOG: background worker "logical replication launcher" (PID 64865) exited with exit code 1
2019-03-27 01:10:46.718 EDT [64860] LOG: shutting down
2019-03-27 01:10:46.731 EDT [64858] LOG: database system is shut down
 done
server stopped
[postgres@postgres pg11]$

--pg_rewind: Operation on postgres

[postgres@postgreshot pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
 
target server needs to use either data checksums or " = on"
Failure, exiting
[postgres@postgreshot pg11]$

Note: When the database is initdb, checksums need to be enabled or "wal_log_hints = on", then set the wal_log_hints parameters of the main and backup nodes and restart the database.

[postgres@postgres pg11]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.40.131 port=5442 user=replica password=replica'
servers diverged at WAL location 0/70001E8 on timeline 5
rewinding from last common checkpoint at 0/6000098 on timeline 5
Done!
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$

Note: pg_rewind succeeded.

--Adjust File: postgres Operation

[postgres@postgres pg11]$ mv  
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ cat  | grep -v '^#'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=192.168.40.131 port=5442 user=replica application_name=pg2'  # . 'host=localhost port=5432'
trigger_file = '/home/postgres/pg11/trigger'
[postgres@postgres pg11]$

--Start the original main library and operate on postgres

[postgres@postgres pg11]$ pg_ctl start -D $PGDATA
waiting for server to start....2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv4 address "0.0.0.0", port 5442
2019-03-27 01:14:48.028 EDT [66323] LOG: listening on IPv6 address "::", port 5442
2019-03-27 01:14:48.031 EDT [66323] LOG: listening on Unix socket "/tmp/..5442"
2019-03-27 01:14:48.045 EDT [66324] LOG: database system was interrupted while in recovery at log time 2019-03-27 01:08:08 EDT
2019-03-27 01:14:48.045 EDT [66324] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2019-03-27 01:14:48.084 EDT [66324] LOG: entering standby mode
2019-03-27 01:14:48.089 EDT [66324] LOG: redo starts at 0/6000060
2019-03-27 01:14:48.091 EDT [66324] LOG: invalid record length at 0/7024C98: wanted 24, got 0
2019-03-27 01:14:48.096 EDT [66331] LOG: started streaming WAL from primary at 0/7000000 on timeline 6
2019-03-27 01:14:48.109 EDT [66324] LOG: consistent recovery state reached at 0/7024CD0
2019-03-27 01:14:48.110 EDT [66323] LOG: database system is ready to accept read only connections
 done
server started
[postgres@postgres pg11]$ 
[postgres@postgres pg11]$ pg_controldata | grep cluster
Database cluster state:  in archive recovery
[postgres@postgres pg11]$

--Data verification, operation on postgres

[postgres@postgres pg11]$ p
psql (11.1)
Type "help" for help.
 
postgres=# select count(*) from test_1;
 count 
-------
 10
(1 row)
 
postgres=#

Note: pg_rewind was successful, the original main library is now started with the library preparatory role, and the data table test_1 is also synchronized.

pg_rewind principle

The basic idea is to copy everything from the new cluster to the old cluster, except for the blocks that we know to be the same.

1)Scan the WAL log of the old cluster, starting from the last checkpoint before the point where the new cluster's timeline history forked off from the old cluster. For each WAL record, make a note of the data blocks that were touched. This yields a list of all the data blocks that were changed in the old cluster, after the new cluster forked off.

2)Copy all those changed blocks from the new cluster to the old cluster.

3)Copy all other files like clog, conf files etc. from the new cluster to old cluster. Everything except the relation files.

4) Apply the WAL from the new cluster, starting from the checkpoint created at failover. (Strictly speaking, pg_rewind doesn't apply the WAL, it just creates a backup label file indicating that when PostgreSQL is started, it will start replay from that checkpoint and apply all the required WAL.)

Supplement: Postgres main and preparation points are stepped on when building

Create a collection of pitfalls when copying pg main and backup streams

1: The socket path problem is reported as follows

Hello! This is the first time you have used the welcome page displayed by **Markdown Editor**. If you want to learn how to use the Markdown editor, read this article carefully and learn about the basic grammar knowledge of Markdown. Solution: Modify unix_socket_permissions = ‘*’ path to the path in the above error report and restart it

2: When building the main backup library, the data directory of the backup library must be used to backup the data from the main library. You can use pg_basebackup, or you can use tar package, unpackage, and perform basic backups.

If the backup library has been accidentally initialized, please delete * in the data directory and restart using the basic backup of the main library.

3: An error occurred when the backup library starts FATAL: no pg_hba.conf entry for replication connection from host "172.20.0.16", user "repl" and other issues

For example, master: IP: *.1 standby: IP *.2 main backup account repl

Then it is not possible to simply specify the host replication repl *.2 md5 in pg_hba.cnf

You also need to add host all all *.2 md5 before this record

First of all, you must be able to access the main library before you can be eligible to use the repl account for synchronization

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.