Why did full backup file size grow after data movement for Availability Group replica was suspended?

Question

Found an error log informing that data movement to a secondary availability replica was suspended. I'm not sure whether the cause of the suspension is relevant to answering my question, but the cause was "Error 3456 ... could not redo log record" etc. This is a known issue with a resolution involving patching SQL Server.
However this condition wasn't addressed for a few days. The nightly full backup file sizes (taken against the primary replica) began increasing by about 10% per night. The database also has log backups made every 15 minutes for most hours in the day, but log backups are disabled during a window at night while indexes are rebuilt and full backups made.
To get the secondary back online it was required to restore from backup. In this instance some issues were encountered - first an attempt was made to rejoin the AG but that hung for ages. Then a command was given to have that replica database removed from the AG but of course it hung too. Killed both those processes but could not kill the session that had faulted in redoing the log records (as it is not a user session - although its ID was above 50 by the way). I therefore restarted SQL Server on that replica before eventually restoring from backup and rejoining the AG.
Having resolved the issue, the next nightly full backup file size (on the primary, again) had reduced back to normal size.
Why would the full backup file increase in size until this issue was resolved? Is it because the log could not be cleared and this was being included in the full backup? If so, then what was getting backed up in the 15-minutely log backups? Imagine for a moment that the primary had suffered a failure during this period - could I not have restored from the available full and log backup files I had (ie. because obviously something remained in the log such that it wasn't clearing)? And if it was a case of the log not clearing, then why (because the secondary had been suspended, so surely the primary no longer waits for the secondary before committing its own log)?
Afterthought: the system process that I could not kill without a server restart was the one that faulted while attempting to apply the logs to the secondary - leading the secondary to be suspended. I suppose this is a "long running transaction"? If this is the reason, can someone still please clarify for me the earlier questions - particularly about how "at risk" the primary was until this was resolved?

youcantryreachingme · Answer

I'm going to hazard a guess here -
Given that the fault that led to the replica being suspended is an acknowledged bug in SQL Server, my guess is that as part of the bug, the primary was somehow continually re-trying to send synchroisation updates to the secondary. Those could not be hardened on the secondary's log so the primary retained the transaction information and the continual re-tries slowly increased the amount of log being used for this transaction.
Further, all other transactions on the primary were not being synced to the secondary, and so were persisted to the primary's log and the log backups on the primary were backing up this information.
Lastly, the fact the primary "held on" to the secondary in this way (with the thread that didn't die) is simply also a part of the bug, or is an additional bug that was exposed when the secondary failed to apply the logs. EDIT: It looks like this is a second known issue. In other words, SQL Server should have "let go" of the secondary once the secondary was suspended, but did not, and that is also a bug.
As to the question of what might have been lost had the primary failed during this extended period - my best guess is that the loss would have been limited to information that was tying up the log and that information related to keeping the secondary in sync. If the primary wasn't persisting the changes until receiving confirmation from the secondary that the secondary had hardened that log, then whatever changes were being carried out by the transaction being synced, would have been lost - but all other transaction would have been persisted to the primary (only) and subsequently backed up during the log backups (and then nightly full backups).
Certainly open to feedback on this answer, as it is a "best guess" for the scenario.

Why did full backup file size grow after data movement for Availability Group replica was suspended?

One Answer

Add your own answers!

Ask a Question