Ed Pollack

Troubleshooting Database Mail Failures

April 25, 2018 by

Description

Database Mail is a convenient and easy way to send alerts, reports, or data from SQL Server. Failures are not obvious to the us though, and developing a process to monitor these failures alongside other failures will save immense headaches if anything ever goes wrong.

Database Mail: a (very) brief overview

Database Mail is a component of SQL Server that is available in every edition, except for Express. This feature is designed to be as simple as possible to enable, configure, and use.

Database Mail relies on SMTP to send emails via a specified email server to any number of recipients. When configuring, you provide a mail server, credentials (if needed), and then the service is ready to use. We’ll be focusing here on failure reporting and not configuration. If you need help setting up or configuring this feature, check out some of the references at the end of this article.

Once configured, database mail is used as the default method of email delivery by SQL Server unless specified otherwise. This includes emails that originate from SQL Server Agent via job failures or alerts, as well as any emails that we send using sp_send_dbmail. Sp_send_dbmail is a system stored procedure that resides in MSDB, and provides the ability to send emails using a wide variety of features. For example, we can send an email with a simple configuration like this:

When executed, we get a rather unceremonious (though desired) response:

There are another 20 parameters that can be used as well, ranging from file attachments, queries, importance, BCC, etc…You can also format an email as HTML and include tables and tags to add style or to organize a complex message. We will demo more involved usage of sp_send_dbmail below as we attack failure reporting.

Given all these features, it is possible for things to break along the way, either within a query, reading a profile, connecting to a mail server, or sending the email. We’ll dive into different ways in which Database Mail can fail, how to alert on these failures, and how to respond effectively to emails that have failed to send.

Viewing failure data

To manage Database Mail failures correctly, we need to set boundaries for our research. An email is generated and sent from SQL Server, but once it has been handed off to the mail server successfully, it is no longer in the scope of SQL Server. An email that is delivered to a mail server successfully is seen as a success by SQL Server, even if a failure occurs later on that mail server.

For our purposes, we will limit our research and alerting to features and components within SQL Server. If issues arise in which an email has been sent successfully by SQL Server, but not received by its recipients, then it will be necessary to check the mail server logs to ensure that something else did not go wrong. Invalid recipients, getting caught in spam filters, and server connectivity issues are not uncommon and can result in emails that appear to have sent correctly, but never reached their destination.

An important note to begin with: Database Mail failures reside in a separate log from the SQL Server and SQL Server Agent logs. This is a big deal as it means that most standard error log reports/searches will NOT include these errors by default! As a result, without explicitly logging and alerting on Database Mail failures, you may not know that emails are failing to send until someone downstream complains. As always, it is in our best interest to catch failures of any kind quickly and (if possible) resolve them before they negatively impact others that rely on those processes.

The image above shows how error logs are separated by type, and that the Database Mail log is distinct from other SQL Server and Windows logs. While viewing errors in the GUI is useful when we have a known problem to chase down, it is not an effective form of monitoring as a human staring at a screen waiting for errors is not terribly efficient (or fun!)

All of the work that we accomplish here will be to report on this log, including as much detail as is possible about both the failure, as well as the email that failed to send.

To start, let’s introduce the system views that contain Database Mail data and take a look at some of the date contained within them:

Msdb.dbo.sysmail_profile: For each Database Mail profile that exists, a row will be in this view that provides the profile name, description, last modified time, and last modified user. When using sp_send_dbmail, you must specify a profile that was previously defined in the Database Mail configuration. If you have yet to define any, then check out a guide on configuring Database Mail in the links at the end of this article. The key piece of information in this view is the profile name, as this is what you reference when using Database Mail.

Msdb.dbo.sysmail_event_log: Contains a row per informational or error message returned by Database Mail. You may configure the sensitivity of this collection via the Logging Level setting in the Database Mail configuration. In general, I prefer extended logging as we can very easily filter out informational messages later. The extra details can be useful when troubleshooting a server problem. For example, knowing when Database Mail starts and shuts down can be helpful when diagnosing email, network, or service problems.

If anything unusual happens, it will be logged here. With details of the error message, as well as the process are logged here. More importantly, the mail item ID is also included, allowing us to tie a Database Mail error directly to a specific email that failed to send.

Msdb.dbo.sysmail_faileditems: If a message fails to send, this contains all of the parameters that were passed to sp_send_dbmail, as well as the time of the failure. This is extremely useful for troubleshooting a failure and determining possible causes for the message not sending.

Another great use for this data is that with it, you can reconstruct a failed message and resend it! While automating a resend process could be a bit risky, the ability to aggregate a set of failed messages and resend them en masse is an immense time-saver and avoids the need to manually hack through the alerts, reports, and queries that were included in those messages.

Msdb.dbo.sysmail_help_queue_sp: This system stored procedure returns the status of the Database Mail queue:

The key takeaway from these results is the length of the queue. If this number is zero, then the queue is empty and Database Mail is likely idle. If the queue is greater than zero, then that indicates that there are more messages to process and the service has yet to catch up. If this number is growing larger over an extended period of time, then this could be indicative of a problem with the service or an excessively large flood of messages to Database Mail.

With these building blocks, we can put together a solution that will check for Database Mail Failures, log them, and alert us when detected.

Automating failure alerting

Before diving into code, let’s put together an outline of how we will build our solution. In order to collect data on, log, and alert on failures, we’ll want to follow a process similar to this:

  1. Create a table to store error and failed message details.
  2. Create a stored procedure that will check for and log Database Mail failures
    1. Log all failed items to the table above.
    2. If any failed items were logged above, the email an operator about them.
    3. Flag that failed item as sent, so we do not resend repeatedly.
  3. Create a job that periodically calls the stored procedure above.
  4. Create a view that allows easy reconstruction of the original Database Mail command, using the various components collected via the alerting process.

Create a table to store database mail failure data

We want to include two distinct sets of data within this table:

  1. The details of the failure, including time and error message.
  2. The details of the email itself, including contents, attachments, queries, etc…

By having both of these components, we can not only fix the cause of the failure, but we can also validate the email that was to be sent and resend it, if needed.

This table contains every column from sysmail_faileditems, as well as some error details from sysmail_event_log. The indexes on error times allows us to more efficiently search for failures in the future, if this table gets large.

This amount of detail may seem excessive, but only with all message details are we capable of fully understanding what email was to be sent, to whom, and all of its details. Oftentimes, many of these columns will be NULL as the messages may be simple emails with a subject, body, and a few recipients.

Create a stored procedure to log failures

With a place to save failure information, we can now create a stored procedure that will populate this table with details whenever a Database Mail failure occurs. Functionally, our tasks are simple and will not take that much code to complete.

Our proc definition will include a single parameter:

This parameter allows us to determine how far back in time to look for failures. We should choose a time frame large enough to account for timeouts, delays, and maintenance, but not so long that it might pull noise from old/stale items, or noise that may result from any MSDB cleanup of old items you may perform. We’ll choose 1 day (1440 minutes), but feel free to adjust higher or lower as needed.

First, we’ll determine the UTC offset from local time. Since times stored in the system views are in local server time, we need to convert to UTC in order to store our times in UTC. We’re choosing UTC over local time to make the code and data more portable:

Whenever accessing data, remember that times are in UTC and need to be converted if you’re looking to view them in your local time zone. The benefit will be that data can be read from anywhere without question as to its time zone or locale. That being said, we’ll include local time as well, for convenience.

This query will pull data from the database mail views in MSDB that we discussed earlier and deposit it into the database_mail_failure table. Note that the bit has_email_been_sent_to_operator is set to 0. We will change this to 1 later on, after an alert has been sent.

Now that we have logged the failure, we can compose an email that will alert us of it. You may be asking, “What if email is down, how will we get that alert?” We’ll address that shortly as it is a valid question, and one that should be asked for any alerting system.

This alert checks to see if any unsent failures exist, and if so will compose an email with the most pertinent information from the table and send it over to me. Not that the mail profile and recipients are hard-coded here. Feel free to add these as stored proc parameters if you have need to change them often.

Lastly, we will set the has_email_been_sent_to_operator to 1 to signify that these alerts have been passed on to an operator:

Create a SQL Server Agent job

With an alerting stored procedure available for use, we’ll create a job that runs every 15 minutes and checks for failed Database Mail messages:



Create a view to reconstruct messages

Within our database_mail_failure table are all of the parameters from a typical use of Database Mail. As a result, we can use that information to reconstitute a command that could then be used to resend the original message. This provides us with a great deal of convenience in the event of an emergency. We can use this to resend failed emails, or at least review messages before resending.

In the event that a large number of failures occur at once, this allows us to avoid a laborious manual process in order to accomplish a this task. The following view returns every column from database_mail_failure, as well as one additional column called database_mail_query, which contains an sp_send_dbmail command using the details from the failed message:

While that is quite a bit of TSQL, it’s mostly formatting, NULL handling, and cleanup of the original details in order to produce a valid sp_send_dbmail command. We can test the view with a query that checks based on error date:

The result shows our new column added to the beginning of the column list:

We can pull out the first column for a query in the list and view the entirety of its text here:

There it is! The original email message that failed. We can tell now that it failed due to the @recipients parameter receiving an (intentionally) invalid email address. From here, we can take the necessary troubleshooting steps and choose to resend the email, if necessary.

Test the process

One last step for us is to fail an email and validate that the process works correctly. How does the alert look and does it contain all of the information we are looking for?

The above email was the result of my sending two emails using sp_send_dbmail and providing invalid email addresses for the @recipients parameter. The resulting alert is formatted into an HTML table using data from database_mail_failure. A convenience here is that we can reconstruct the email or any related data from the failed message at any time from that table. This provides some level of insurance against additional alerting problems that make receiving the message above problematic.

But, what happens if email is down?!

The question that you have been thinking about since the start of this article is about to be answered! If email is unavailable, and we are alerting on failed messages via email, then how will this work!? There are several ways to attack this problem, and ideally, we would address all of them:

Monitor & alert the mail server effectively

First and foremost, we should have monitoring and alerting configured for our mail server. If email is an important channel of communication for production database server events then it must be monitored. If the mail server becomes unavailable or ceases to send/receive messages for any significant amount of time, then the appropriate people should be notified in order to fix it. Odds are if the email is down that many other important alerts are also not being received.

Monitor SQL Server Agent

SQL Server Agent jobs are all executed via the SQL Server Agent service. This is a Windows process that should be monitored and alerted on if it stops or becomes unresponsive. If SQL Server Agent is used for any monitoring, alerting, or data processing, then it should receive a similarly high level of priority with regards to ensuring that it is up and running at all times. If it goes down for any reason, then we’d benefit from immediate alerts to prevent downstream processing from failing, or worse, never running.

Consider other communication methods

While SQL Server relies heavily on email for alerting, you can integrate other monitoring tools that your organization uses into alert data in order to trigger texts, phone calls, or other types of communications.

For example, the database_mail_failure table could be monitored for new entries, and in addition to email, send a text to an on-call resource. Oftentimes, repeated failed messages will be indicative of a bigger problem that may be affecting a larger cross-section of an organization, whereas a single failure may simply be the result of a bad email address or email parameters.

Implement a secondary mail server

If email is a critical alerting component, then adding high availability to it can be a great way to avoid havoc if the mail server fails. Once a secondary server is available, you can create a second mail account and profile on your SQL Servers and automatically swap the default profile to it when needed.

Aggregation and summarization of alerts

If an organization manages many SQL Servers, there can be a benefit to concatenating messages across all servers and sending out a summary periodically. This strategy is more involved but has a beneficial side effect of offloading an alert email to another server. If the mail server used by this server is different from the failure’s originating server, then we have a somewhat roundabout way to avoid losing insight into failures.

That is a lot of work!

While these are all valid approaches, we realistically do not need to consider all of them. If email is monitored and maintained sufficiently, then the odds of a prolonged outage are low and most organizations are OK with that risk with the understanding that an operator can respond to and resolve a mail server issue quickly and efficiently if needed.

Still, addressing single-points-of-failure is important, and accepting that email is often a bottleneck for alerting and monitoring helps in avoiding long-term outages due to lack of knowledge of a mail server problem. We often consider no news as good news in the world of alerting, but no news can also mean that alerting processes are broken, and silence is the result.

Retrieving failed messages later

Our process leaves failed message details in the dbo.database_mail_failure table indefinitely. As a result, it’s possible to return to it at any time and review what is in there. If we wanted more flexibility in alerting, we could separate the alerting portion of failed message data collection into a separate process that reruns until successful.

Alternatively, if a failed message email fails, then another failed message alert will be generated. While not a fun situation, this would at least ensure that we do not completely lose insight into mail server problems.

Generally speaking, failed Database Mail messages should be rare, but if they happen very frequently on your servers, you may wish to add some cleanup into the database_mail_failure table. Simply add a DELETE statement at the end of the process that removes rows that have been alerted on, and where the error date is older than some acceptable amount of time.

Conclusion

Database Mail is a useful tool that allows emails to be generated from SQL Server quickly and efficiently. It also is a separate component of SQL Server, complete with its own error log and configuration.

This feature can fail for many reasons, including mail server outages, network configuration changes, invalid emails or parameters, or downstream problems in the mail server.

Setting up alerting against Database Mail, storing failed message data, and being able to quickly resend messages can save immense time and remove the chance that a process failure is resulting in an extended outage of email from SQL Server.

Downloads


Ed Pollack
Database mail

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

168 Views