SoFunction
Updated on 2025-04-08

Implement automatic and expired cleaning of data in PostgreSQL

In PostgreSQL, automatic cleaning and expiration of data can be implemented in a variety of ways to ensure that the database does not suffer performance and waste of storage space due to excessive storage of outdated or no longer needed data. Here are some common methods and detailed examples:

1. Use TIMESTAMP columns and regular tasks

  • Added when creating a tableTIMESTAMPColumns are used to record the creation time or the last update time of the data

Suppose we have a name calledordersThe table is used to store order information, which contains onecreated_atColumns to record the time when the order was created:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    order_amount DECIMAL(10, 2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
  • Create regular tasks (for example, usingcronor operating system timing tasks) to execute SQL statements that delete expired data

Suppose we want to delete order data that has been created for more than 30 days, we can write the following SQL statement:

DELETE FROM orders WHERE created_at < CURRENT_TIMESTAMP - INTERVAL '30 days';

Then, you can use the operating system's timing task tools (such ascronIn Linux systems) to execute the above SQL statements regularly. Assume that cleaning tasks are performed at 2 a.m. every day,cronThe expression may be as follows:

0 2 * * * psql -U your_username -d your_database -c "DELETE FROM orders WHERE created_at < CURRENT_TIMESTAMP - INTERVAL '30 days';"

The advantages of the above method are that it is simple and direct, easy to understand and implement. The disadvantage is that it needs to rely on the operating system's timing task mechanism and there may be a certain time delay, i.e., there may be a time difference between reaching the specified cleaning time point and actually performing the cleaning operation.

2. Use Event Triggers

PostgreSQL provides the function of event triggers that can be used in specific database events (such asINSERTUPDATEDELETEetc.) Execute custom functions when they occur.

  • First, create a function to handle the expired cleaning logic of the data
CREATE OR REPLACE FUNCTION expire_orders() RETURNS TRIGGER AS
$$
BEGIN
    DELETE FROM orders WHERE created_at < CURRENT_TIMESTAMP - INTERVAL '30 days';
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;
  • Then, create an event trigger
CREATE TRIGGER trigger_expire_orders
AFTER INSERT OR UPDATE ON orders
EXECUTE FUNCTION expire_orders();

In this way, wheneverordersWhen the table is inserted or updated, it will be triggered.expire_ordersFunctions clean up expired data.

The advantage of this method is that it has better real-time performance and data expiration processing can be performed immediately when related operations occur. The disadvantage is that it may bring certain performance overhead to normal insertion or update operations, especially when the data volume is large.

3. Use Partitioned Tables

A partitioned table is to divide a large table into several smaller subtables according to some rules, so that data can be managed and manipulated more efficiently.

  • Create a partition table

Suppose the order table is partitioned by month:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    order_amount DECIMAL(10, 2),
    created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2023_01 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01 00:00:00') TO ('2023-01-31 23:59:59');

CREATE TABLE orders_2023_02 PARTITION OF orders
    FOR VALUES FROM ('2023-02-01 00:00:00') TO ('2023-02-28 23:59:59');

-- Create partition tables for other months
  • Regularly delete expired partitions

Can be passedDROP TABLEStatement to delete expired partitions, such as deleting the partitions of the previous month at the beginning of each month:

DROP TABLE orders_2023_01;

The advantage of partitioned tables is that they perform better when processing large amounts of data, and the operation of deleting expired partitions is relatively simple and efficient. The disadvantage is that the process of creating and managing partition tables is relatively complicated, and it is necessary to plan the partitioning strategy in advance.

4. Combining stored procedures and timing tasks

  • Create stored procedures
CREATE OR REPLACE PROCEDURE clean_expired_data()
LANGUAGE plpgsql
AS $$
BEGIN
    DELETE FROM orders WHERE created_at < CURRENT_TIMESTAMP - INTERVAL '30 days';
END;
$$;
  • Calling stored procedures using timed tasks

Similar to the SQL statements mentioned above using timed tasks to execute SQL statements, except that the stored procedure is called instead:

0 2 * * * psql -U your_username -d your_database -c "CALL clean_expired_data();"

This approach combines the encapsulation of stored procedures and the flexibility of timing tasks to facilitate maintenance and management of complex cleaning logic.

The following is a comprehensive example to show how to use a combination of the above methods in practical applications:

Sample Scenario

Suppose we have a user activity log tableuser_activity_log, used to record various operations of users in the system, including operation timeactivity_timeand operation detailsactivity_details. We want to clean up activity logs for more than 90 days regularly.

Implementation steps

  • Create a table
CREATE TABLE user_activity_log (
    log_id SERIAL PRIMARY KEY,
    user_id INT,
    activity_details TEXT,
    activity_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
  • Create stored procedures for cleaning data
CREATE OR REPLACE PROCEDURE clean_expired_activity_logs()
LANGUAGE plpgsql
AS $$
BEGIN
    DELETE FROM user_activity_log WHERE activity_time < CURRENT_TIMESTAMP - INTERVAL '90 days';
END;
$$;
  • Set up operating system timing tasks
    Assume that using LinuxcronService, perform cleaning tasks at 1 a.m. every day:
0 1 * * * psql -U your_username -d your_database -c "CALL clean_expired_activity_logs();"

Testing and Verification

After actually running for a period of time, you can verify that the cleaning operation is as expected by querying the data in the table:

SELECT * FROM user_activity_log;

Check if only the activity logs exist within the last 90 days.

This is the article about implementing automatic and expired data cleaning in PostgreSQL. For more related content on PostgreSQL to implement data cleaning, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!