SoFunction
Updated on 2025-03-03

Detailed analysis of PostgreSQL transaction rewinding case

background

A while ago, a customer reported that his RDS PostgreSQL could not be written, and the error message was as follows:

postgres=# select * from test;
 id 
----
(0 rows)

postgres=# insert into test select 1;
ERROR:  database is not accepting commands to avoid wraparound data loss in database "xxxx"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

After the RDS engineer intervened and dealt with it, the problem was immediately resolved.

Basic principles of XID

XID definition

XID (Transaction ID) is the transaction number inside PostgreSQL. Each transaction will be assigned an XID, incremented in sequence. Each tuple header in PostgreSQL data will store the XID of the tuple (Transaction ID) of the tuple (Transaction ID), and then the kernel constructs the database consistent reading through this XID. When the transaction isolation level is repeatable read, assuming that if there are two transactions, xid1=200 and xid2=201, then only tuples of t_xmin <= 200 are seen in xid1, and tuples of t_xmin > 200 are not seen.

typedef uint32 TransactionId;  /* Transaction number definition, 32-bit unsigned integer */
typedef struct HeapTupleFields
{
  TransactionId t_xmin;    /* Transaction number that inserts the tuple */
  TransactionId t_xmax;    /* Delete or lock the transaction number of the tuple */

    /*** Other attributes omitted ***/
} HeapTupleFields;

struct HeapTupleHeaderData
{
  union
  {
    HeapTupleFields t_heap;
    DatumTupleFields t_datum;
  }      t_choice;

    /*** Other attributes omitted ***/
};

XID Issuing Mechanism

From the above structure we can see that XID is a 32-bit unsigned integer, that is, the range of XID is 0 to 2^32-1; then what should I do if the transaction exceeds 2^32-1? In fact, XID is a ring, and after exceeding 2^32-1, it will be allocated from scratch. The above conclusion is also proved through the source code:

// Invalid transaction number#define InvalidTransactionId    ((TransactionId) 0)
// Boot transaction number, used in the database initialization process (BKI execution)#define BootstrapTransactionId    ((TransactionId) 1)
// Freeze transaction numbers are used to represent very old tuples, which are earlier than all normal transaction numbers (that is, visible)#define FrozenTransactionId      ((TransactionId) 2)
// The first normal transaction number#define FirstNormalTransactionId  ((TransactionId) 3)
// Generate xid with the lower 32 bits of FullTransactionId as an unsigned integer#define XidFromFullTransactionId(x)    ((uint32) (x).value)

static inline void
FullTransactionIdAdvance(FullTransactionId *dest)
{
  dest-&gt;value++;
  while (XidFromFullTransactionId(*dest) &lt; FirstNormalTransactionId)
    dest-&gt;value++;
}

FullTransactionId
GetNewTransactionId(bool isSubXact)
{
    /*** Omit ***/
  full_xid = ShmemVariableCache-&gt;nextFullXid;
  xid = XidFromFullTransactionId(full_xid);
    /*** Omit ***/
  FullTransactionIdAdvance(&amp;ShmemVariableCache-&gt;nextFullXid);
    /*** Omitted ***
  return full_xid;
}

static void
AssignTransactionId(TransactionState s)
{
    /*** Omit ***/
  s-&gt;fullTransactionId = GetNewTransactionId(isSubXact);
  if (!isSubXact)
    XactTopFullTransactionId = s-&gt;fullTransactionId;
    /*** Omit ***/
}

TransactionId
GetTopTransactionId(void)
{
  if (!FullTransactionIdIsValid(XactTopFullTransactionId))
    AssignTransactionId(&amp;TopTransactionStateData);
  return XidFromFullTransactionId(XactTopFullTransactionId);
}

As you can see, the new transaction number is saved in the shared variable cache: ShmemVariableCache->nextFullXid. After each transaction number is issued, adjust its value upwards and skip the above three special values. The three special components are 0, 1 and 2 respectively. The functions can be found in the code comments above.

XID rollback mechanism

As mentioned earlier, XID is a ring, which is allocated to 2^32-1 and then started from 3. So how does the kernel compare the sizes of two transactions? For example, xid has gone through such a process 3-> 2^32-1 -> 5, so how does the kernel know that 5 This transaction is behind 2^32-1? Let's look at the code again:

/*
 * TransactionIdPrecedes --- is id1 logically < id2?
 */
bool
TransactionIdPrecedes(TransactionId id1, TransactionId id2)
{
  /*
   * If either ID is a permanent XID then we can just do unsigned
   * comparison.  If both are normal, do a modulo-2^32 comparison.
   */
  int32    diff;

  if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
    return (id1 < id2);

  diff = (int32) (id1 - id2);
  return (diff < 0);
}

As you can see, the kernel uses a tricky method: (int32) (id1 - id2) < 0, the value range of 32-bit signed integers is -2^31 to 231-1, and the value obtained by 5-(232-1) is larger than 2^31-1, so converting to int32 will turn into a negative number. However, there is a problem here. The "Latest Transaction Number-Oldest Transaction Number" must be less than 2^31. Once it is greater, a rewind will occur, resulting in the data generated by the old transaction being invisible to the new transaction.

XID rollback prevention

As mentioned earlier, "Latest Transaction Number - Oldest Transaction Number" must be less than 2^31, otherwise a rewind will occur, causing the data generated by the old transaction to be invisible to the new transaction. How does the kernel avoid this problem? The kernel handles it like this: by regularly updating the XID of the tuple generated by the old transaction to FrozenTransactionId, that is, to 2, to recycle the XID, and the tuple with XID of 2 is visible to all transactions. This process is called XID freeze. In this way, XID can be recycled to ensure |Latest transaction number - Oldest transaction number | < 2^31.
In addition to the kernel's automatic freezing and recycling XID, we can also manually perform xid freezing and recycling through commands or sql.

  • Query the age of the database or table. The database age refers to: "Latest transaction number - the oldest transaction number in the database", and the table age refers to: "Latest transaction number - the oldest transaction number in the table"
# Check the age of each librarySELECT datname, age(datfrozenxid) FROM pg_database;

#1 library age sorting for each tableSELECT ::regclass as table_name, greatest(age(),age()) as age FROM pg_class c LEFT JOIN pg_class t ON  =  WHERE  IN ('r', 'm') order by age desc; 

# Check the age of 1 tableselect oid::regclass,age(relfrozenxid) from pg_class where oid='schema name.table name'::regclass::oid;

Manually freeze sql to recycle tuples of a table:

vacuum freeze Table name;

Manually freeze the command to recycle all tables in a library xid:

vacuumdb -d Library name --freeze --jobs=30 -h Connection string -p Port number -U LibraryOwner

The freeze recycling process is an IO-repeated operation. The kernel will describe all pages of the table and then update the t_xmin field of the tuple that meets the requirements to 2. Therefore, this process needs to be carried out at the business peak to avoid affecting the business.

There are three kernel parameters related to freezing and recycling: vacuum_freeze_min_age, vacuum_freeze_table_age and autovacuum_freeze_max_age. Since the author does not have a deep understanding of these three parameters, he will not show off his skills here. Interested students can find information to learn about it by themselves.

Solution

Problem analysis

Based on the above principle analysis, we know that "Latest transaction number - Oldest transaction number" =  2^31-1000000, that is, when there is only one million available xid left, the kernel will prohibit instance writing and report an error: database is not accepting commands to avoid wraparound data loss in database. At this time, you must connect to "xxxx" in the prompt to freeze the table to recycle more XIDs.

void
SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
{
  TransactionId xidVacLimit;
  TransactionId xidWarnLimit;
  TransactionId xidStopLimit;
  TransactionId xidWrapLimit;
  TransactionId curXid;

  Assert(TransactionIdIsNormal(oldest_datfrozenxid));

  /*
      * xidWrapLimit = the oldest transaction number + 0x7FFFFFFFF. Once the current transaction number reaches xidWrapLimit, a volume back will occur.
    */
  xidWrapLimit = oldest_datfrozenxid + (MaxTransactionId &gt;&gt; 1);
  if (xidWrapLimit &lt; FirstNormalTransactionId)
    xidWrapLimit += FirstNormalTransactionId;

  /*
      * Once the current transaction number reaches xidStopLimit, the instance will not be written, and 1000000 xid is reserved for vacuum
      * Each vacuum table needs to occupy one xid
    */
  xidStopLimit = xidWrapLimit - 1000000;
  if (xidStopLimit &lt; FirstNormalTransactionId)
    xidStopLimit -= FirstNormalTransactionId;

  /*
      * Once the current transaction number reaches xidWarnLimit, it will be received continuously
      * WARNING: database "xxxx" must be vacuumed within 2740112 transactions
    */
  xidWarnLimit = xidStopLimit - 10000000;
  if (xidWarnLimit &lt; FirstNormalTransactionId)
    xidWarnLimit -= FirstNormalTransactionId;

  /*
      * Force autovacuums will be triggered once the current transaction number reaches xidVacLimit
    */
  xidVacLimit = oldest_datfrozenxid + autovacuum_freeze_max_age;
  if (xidVacLimit &lt; FirstNormalTransactionId)
    xidVacLimit += FirstNormalTransactionId;

  /* Grab lock for just long enough to set the new limit values */
  LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
  ShmemVariableCache-&gt;oldestXid = oldest_datfrozenxid;
  ShmemVariableCache-&gt;xidVacLimit = xidVacLimit;
  ShmemVariableCache-&gt;xidWarnLimit = xidWarnLimit;
  ShmemVariableCache-&gt;xidStopLimit = xidStopLimit;
  ShmemVariableCache-&gt;xidWrapLimit = xidWrapLimit;
  ShmemVariableCache-&gt;oldestXidDB = oldest_datoid;
  curXid = XidFromFullTransactionId(ShmemVariableCache-&gt;nextFullXid);
  LWLockRelease(XidGenLock);

  /* Log the info */
  ereport(DEBUG1,
      (errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
          xidWrapLimit, oldest_datoid)));

  /*
      * If the current transaction number >= the oldest transaction number + autovacuum_freeze_max_age
      * Trigger autovacuum to clean up the oldest database. If multiple databases meet the requirements, clean up in the oldest order.
    * Tag the current autovacuum by setting the flag bit and do it again after the autovacuum is finished.
      */
  if (TransactionIdFollowsOrEquals(curXid, xidVacLimit) &amp;&amp;
    IsUnderPostmaster &amp;&amp; !InRecovery)
    SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);

  /* Give an immediate warning if past the wrap warn point */
  if (TransactionIdFollowsOrEquals(curXid, xidWarnLimit) &amp;&amp; !InRecovery)
  {
    char     *oldest_datname;

    if (IsTransactionState())
      oldest_datname = get_database_name(oldest_datoid);
    else
      oldest_datname = NULL;

    if (oldest_datname)
      ereport(WARNING,
          (errmsg("database \"%s\" must be vacuumed within %u transactions",
              oldest_datname,
              xidWrapLimit - curXid),
           errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
               "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
    else
      ereport(WARNING,
          (errmsg("database with OID %u must be vacuumed within %u transactions",
              oldest_datoid,
              xidWrapLimit - curXid),
           errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
               "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
  }
}

bool
TransactionIdFollowsOrEquals(TransactionId id1, TransactionId id2)
{
  int32    diff;
  if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
    return (id1 &gt;= id2);

  diff = (int32) (id1 - id2);
  return (diff &gt;= 0);
}

FullTransactionId
GetNewTransactionId(bool isSubXact)
{
    /*** Omit ***/
  full_xid = ShmemVariableCache-&gt;nextFullXid;
  xid = XidFromFullTransactionId(full_xid);

  if (TransactionIdFollowsOrEquals(xid, ShmemVariableCache-&gt;xidVacLimit))
  {
    TransactionId xidWarnLimit = ShmemVariableCache-&gt;xidWarnLimit;
    TransactionId xidStopLimit = ShmemVariableCache-&gt;xidStopLimit;
    TransactionId xidWrapLimit = ShmemVariableCache-&gt;xidWrapLimit;
    Oid      oldest_datoid = ShmemVariableCache-&gt;oldestXidDB;

        /*** Omit ***/
    if (IsUnderPostmaster &amp;&amp;
      TransactionIdFollowsOrEquals(xid, xidStopLimit))
    {
      char     *oldest_datname = get_database_name(oldest_datoid);

      /* complain even if that DB has disappeared */
      if (oldest_datname)
        ereport(ERROR,
            (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
             errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
                oldest_datname),
             errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
                 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
            /*** Omit ***/
    }
        /*** Omit ***/
  }
    /*** Omit ***/
}

Problem positioning

# Check the age of each librarySELECT datname, age(datfrozenxid) FROM pg_database;

#1 library age sorting for each tableSELECT ::regclass as table_name, greatest(age(),age()) as age FROM pg_class c LEFT JOIN pg_class t ON  =  WHERE  IN ('r', 'm') order by age desc; 

# Check the age of 1 tableselect oid::regclass,age(relfrozenxid) from pg_class where oid='schema name.table name'::regclass::oid;

Problem solving

  1. Through the first sql above, find the oldest database. The database age refers to: |Latest transaction number - the oldest transaction number in the database |
  2. Through the second sql above, find the oldest table, and then execute the table in turn: vacuum freeze table name, freeze the old transaction number in the table to recycle the table. The table age refers to: | Latest transaction number - the oldest transaction number in the table|
  3. Operation and maintenance scripts

Single process shell script

# Vacuum freeze on the top 50 oldest tables in the specified database
for cmd in `psql -Uusername -pPort number -hConnection string -dDatabase name -c "SELECT 'vacuum freeze '||::regclass||';' as vacuum_cmd FROM pg_class c LEFT JOIN pg_class t ON  =  WHERE  IN ('r', 'm') order by greatest(age(),age()) desc offset 50 limit 50;" | grep -v vacuum_cmd | grep -v row | grep vacuum`; do
     psql -U username -p port number -h connection string -d database name -c "$cmd"
done

Multi-process Python scripts

from multiprocessing import Pool
import psycopg2

args = dict(host='', port=5432, dbname='Database Name',
            user='username', password='password')

def vacuum_handler(sql):
    sql_str = "SELECT ::regclass as table_name, greatest(age(),age()) as age FROM pg_class c LEFT JOIN pg_class t ON  =  WHERE  IN ('r', 'm') order by age desc limit 10; "
    try:
        conn = (**args)
        cur = ()
        (sql)
        ()
        cur = ()
        (sql_str)
        print ()
        ()
    except Exception as e:
        print str(e)

# Conduct vacuum freeze on the top 1000 oldest tables in the specified database, and 32 processes execute concurrentlydef multi_vacuum():
    pool = Pool(processes=32)
    sql_str = "SELECT 'vacuum freeze '||::regclass||';' as vacuum_cmd FROM pg_class c LEFT JOIN pg_class t ON  =  WHERE  IN ('r', 'm') order by greatest(age(),age()) desc limit 1000;";
    try:
        conn = (**args)
        cur = ()
        (sql_str)
        rows = ()
        for row in rows:
            cmd = row['vacuum_cmd']
            pool.apply_async(vacuum_handler, (cmd, ))
        ()
        ()
        ()
    except Exception as e:
        print str(e)


multi_vacuum()

Friendly Tips

vacuum freeze will scan all pages of the table and update it. It is an IO-repeated operation. During the operation, you must control the concurrency number, otherwise it will be very easy to hang the instance.

Author information

Xie Guiqi (window name: Yuan Qing) joined Alibaba Cloud after graduation in 2020 and has been engaged in RDS PostgreSQL-related work, and is good at solving various online RDS PostgreSQL operation, maintenance and control problems.

Summarize

This is the end of this article about PostgreSQL transaction rewinding. For more related PostgreSQL transaction rewinding content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!