Logging SQL Server database errors

We receive many database alerts with many of the alerts logging some of these same alerts or information to files or tables. What we’ve found over time is that the logging is now costing us quite a bit of resource. Our logging server (where both files and table logging are stored) has had a few outages related to conflicts from messages for other servers. We’ve considered scaling the alerting by environment type, but we’ve also considered that we may be logging too much information about our databases. In addition, since we receive so many alerts each day, it’s impossible for us to resolve them and assist with other issues that arise. What are some techniques that we can use to help us with the issue of too much logging information and too many alerts?

Overview

We often coordinate alerting with logging errors, where some errors become alerts. While we have situations where we want a detailed log of what happened, especially with some errors or application processes, we can over log information or retain information well after we need it and face an issue where our logging uses many resources. In a few cases, we may have a next step we can immediately run or try before logging if an event occurs. We’ll look at some scenarios about when intensive logging is useful and scenarios where we can reduce or eliminate it.

Logged messages require multiple steps to resolve

Suppose that we run a delete script, which removed 1 million rows of data in a database. This delete signals one of our applications to remove further related data in other databases, as our initial database is a starting point for our applications. Eventually, over 25 million rows of data are removed along with other records in other databases being updated for various applications, but we find an error: only 30% of the 1 million rows that were removed needed to be removed. This means that we have a significant amount of deletes to revert along with re-updating related information.

When we consider how we would log a delete where multiple databases are impacted, we would use intensive logging that would cost many resources; for an example:

Log the deletes of the records in first database
Log the deletes in all further databases, that use the first database as a resource
Log the updates to all related database records

We could use change data tracking or queue tables to keep this information (the below code shows an example of using a queue table for this), but we have to keep at least one logging table of the delete operation so that further transactions can execute this same delete. The same would be true for an insert or update if we have further tables that must also be updated or added to with new or changed data.

---- Save our transaction prior to the transaction

INSERT INTO tbQueue

SELECT

ID --- id of values being removed

, 'D' --- operation

, GETDATE() --- time

, 0 --- processed yet in further tables

FROM tbReport

WHERE ID BETWEEN 13 AND 41

---- Execute our transaction

DELETE

FROM tbReport

WHERE ID BETWEEN 13 AND 41

---- Using our queue table to execute removals or updates in tables dependent on our tbReport

DELETE

FROM OtherDatabase.dbo.DependentTableOne

WHERE ID IN (SELECT ID FROM tbQueue WHERE Processed = 0)

DELETE

FROM OtherDatabase.dbo.DependentTableTwo

WHERE ID IN (SELECT ID FROM tbQueue WHERE Processed = 0)

Some form of logging will be required when a change in our first table (Entry Table) affects changes in dependent tables (Dependent Table One and Two), one of which affects another dependent table (Dependent Table On Dependent Table One) and processes that are similar to the above flow.

This carries significant costs to log (not just in space, but we must design this in a manner in which we lose none of these records, or else we can’t revert a change), but without this, how would we accurately revert the original removal? Even if we did a multi-table removal, update or insert where we passed in one removal, update or insert at a time for all tables, upon reversion, we’d still need logging that allowed us to revert the change. A restore database operation is not always an option if a change creates further changes in other databases or tables.

In these situations, we must keep extensive logs of information and I suggest that our logging tables use data compression techniques, such as row or page level compression or column store compression, which can reduce disk usage. In the below images, we see the effects of column store compression for a table that retains logging information from the error log:

1	CREATE CLUSTERED COLUMNSTORE INDEX CCI_tbSavedErrorLog ON tbSavedErrorLog

The table size before we add column store compression.

The table size after we add column store compression.

Logged messages have a couple of solutions

In some cases, we may get extensive log messages on an error or successful messages that may not need to be kept, especially over longer periods of time. For example, most of us frequently run CHECKDB against all our databases on a schedule. In some cases, because our environment may be limited in resources, we may only retain the final CHECKDB message if it passes – whether by saving the output or retaining it from the error log, or we may only retain the latest successful CHECKDB result (assuming it was successful). When we find an error, we want the full error, even with the objects that passed, as we may have multiple steps that we follow or we may want additional information even if we fix the error – what could have caused corruption in the first place is a question we may be asking and want to know more information.

In some environments, we may only want to retain the final message of a CHECKDB command if successful and keep the full message if failed.

When CHECKDB fails – we need to know why it failed and how to resolve it. In this case, logging becomes very useful. The fact that CHECKDB passed seven years ago for our database when we run the command every 8 hours is less important to know and something we don’t need to retain. Since failed events like this require several possible solutions along with related information, we want to keep extensive logging for these types of events. As for successful CHECKDB messages and related logged messages like this, DBAs can decide how much and how long they want to keep the successful messages they receive based on their experience of when they may need them or under what circumstances.

In these situations, we may want to keep failures, related information upon failures, a reduced message on success, or archive or remove old logging in a shorter period. These situations require less logging than the first case, except in some situations, and they may also not need to be retained for a long period of time.

Logged messages have one solution or outcome

In some cases, if an error occurs, we have one solution or outcome after the error, or we may not need logging at all, but an immediate resolution. Suppose that we have a configuration table that we update on deployments – adding new records, updating existing records, and removing records. We could log the success or failure of the events after the script. We could also validate the script as we run it so that we avoid this additional logging, or significantly reduce it.

In the below example, we add one record and update two records to a configuration table and we validate that 3 total rows were added or changed (the @count variable) and we test that the report location we added is valid in our report dump table, where we would keep track of these locations;

DECLARE @count INT = 0, @rptcount INT

BEGIN TRAN

----

INSERT INTO tbConfigurables

VALUES ('D:\Files\fields.rpt',GETDATE(),GETDATE())

SELECT @count += @@ROWCOUNT

UPDATE tbConfigurables

SET LocFile = '<data>17</data>'

, EditDate = GETDATE()

WHERE LocFile = '<data>12</data>'

SELECT @count += @@ROWCOUNT

SELECT @rptcount = COUNT(*) FROM tbReportLocations WHERE FileDump = 'D:\Files\fields.rpt'

---- Testing the changes:

IF ((@count = 3) AND (@rptcount >= 1))

BEGIN

COMMIT TRAN

END

ELSE

BEGIN

ROLLBACK TRAN

END

In some cases, we may not be able to perform this with values we change, as we may require application validation. An alternative route to this is to include a rollback script with every change. For example, to rollback the above script, we could add in a commented section below the script:

---- Roll back script

DELETE

FROM tbConfigurables

WHERE LocFile = 'D:\Files\fields.rpt'

UPDATE tbConfigurables

SET LocFile = '<data>12</data>'

, EditDate = GETDATE()

WHERE LocFile = '<data>17</data>'

The result of this is that we don’t end up with extra information that doesn’t help us, or requires that we make more decisions when we only have one decision to make – the script and (or) validation of the script failed, which means the script needs to be rolled back. This approach holds true for some events that require one or two solutions when they occur, such as an event that always requires a restart or an event that always requires a cache flush, etc. We may want to only log what was executed and whether it passed, or alert with the immediate next step and avoid information that uses space for no reason.

In these situations, we want to solve as much as we can within the transaction or if we only have one action to take after a message. Logging the summary may be useful, but logging each step if on error we do one thing consumes unnecessary resources.

SQLShack

Logging SQL Server database errors

Overview

Logged messages require multiple steps to resolve

Logged messages have a couple of solutions

Logged messages have one solution or outcome

See also

Overview

Logged messages require multiple steps to resolve

Logged messages have a couple of solutions

Logged messages have one solution or outcome

See also

Related posts: