Ed Pollack

Reporting and alerting on job failure in SQL Server

March 12, 2018 by

SQL Server Agent can be used to run a wide variety of tasks within SQL Server. The built-in monitoring tools, though, are not well-suited for environments with many servers or many databases.

Removing reliance on default notifications and building our own processes can allow for greater flexibility, less alerting noise, and the ability to track failure conditions that are not typically tracked by SQL Server!

Introduction

At the heart of the SQL Server Agent service is the ability to create, schedule, and customize jobs. These jobs can be given schedules that determine at what times of day a task should execute. Jobs can also be given triggers, such as a server restart or alert to respond to. Jobs can also be called via TSQL from anywhere that has the appropriate access and permissions to SQL Server Agent.

The built-in notification system allows you to define operators and contact them when a job fails. While convenient, this does not scale well when large numbers of servers are involved, large numbers of jobs exist, or when more complex jobs are devised.

Customizing a job failure notification process is far simpler than it may seem, and in this article, we will walk through how to gather the necessary SQL Server Agent history data and use it to alert operators in far more meaningful ways than the default tools allow.

How Does SQL Server Agent Track Job Status?

SQL Server Agent maintains job, schedule, and execution details in tables within the MSDB database. The following tables provide what we will need to capture to track and alert on failures:

MSDB.dbo.sysjobs

This table contains a row per SQL Server Agent job defined on a given SQL Server instance:

The job_id is a UNIQUEIDENTIFIER that ensures a unique primary key for each job. This table also provides the name, description, create/modify dates, and a variety of other useful information about the job. This table can be queried to determine how many jobs exist on a server or to search based on a specific string in job names or descriptions. We could also search based on owner_sid to determine if any jobs are owned by the wrong login (such as a departing employee or job creator).

MSDB.dbo.syscategories

From within a SQL Server, you may select a category for each job. This allows for classification of reports, data collection processes, alerts, etc…When acting on jobs, we can take that category into account in order to increase the priority of some jobs or decrease the priority of others. For example, we could set some jobs as unmonitored and ignore them. Alternatively, we could set some jobs as critical and have them hit up the on-call operator’s cell phone the moment they fail.

This table is simple enough, and is mostly useful to pull the category name. The category class & type are hard-coded literals that determine where an alert can be used:

  • Class: 1 = job, 2 = alert, 3 = operator.
  • Type: 1 = local, 2 = multiserver, 3 = none.

MSDB.dbo.sysjobhistory

This table contains a row per SQL Server Agent job or job step. Both step status and overall job status are contained in this table:

In addition to the corresponding job_id, step, and error information (if applicable), this table provides details on job runtime, run status, run date and time, and who was notified (if anyone).

Note that a row with step_id = 0 corresponds to the overall job status, and not that of any one step. It is possible for a job with no steps to fail or a job to fail prior to executing any job steps, therefore we can end up with an overall job status with no corresponding step success/failure details.

Sysjobhistory is only populated when a job or job step completes. In the unlikely event that a server restarts abnormally or SQL Server Agent crashes, then it is possible for a job to not be completely logged in this table (or logged at all).

Because this data exists and is readily accessible to us, we can collect, analyze, and use it for notification purposes.

Limitations of Built-In Notifications

By default, you can add notification steps to any SQL Server Agent jobs, either via the GUI or TSQL:

Whether a job fails, succeeds, or completes, you may add an email/page/logging to the end of it. While useful, this only addresses the state of a job when it completes, and not of any specific steps. In reality, we may care about the status of individual steps and wish to act on them on a more granular basis.

A significant (and somewhat confusing) caveat to how SQL Server Agent handles job step failures is that you may configure a job to continue executing, even if a step fails. In the event that a job step fails, the job continues, and future steps succeed, the job will report success, despite one or more steps failing. This is not easy to alert on in SQL Server and common solutions involve either breaking a job into numerous smaller jobs or adding customized alerting steps as needed.

For our purposes, we’d like to build a failure notification job that is simple and all-encompassing. The following are all possible failure conditions that we may wish to report on:

  • A job fails due to one or more steps failing.
  • A job succeeds, but one or more steps fail and we wish to report on those failures.
  • All steps in a job succeed, but the job itself fails. This is often the result of a job configuration issue.
  • A job fails that has no steps defined for it.
  • A job fails prior to any steps executing. This is often the result of a job configuration issue.

To alert on all of these effectively, we will need to write some of our own code to detect, log, and report on them.

Another limitation of built-in notifications are the contents of the alert that you receive. Selecting a notification as shown above will result in a single pre-fabricated notification whenever it is triggered. This is good for letting you know that a job failed, but the included information is often not enough to troubleshoot without going back to the job and reading more details of the failure. The following is an example of what the subject and body of a default SQL Server Agent notification would look like:

SQL Server Job System: ‘Test Job’ completed on EdSQLServer
JOB RUN: ‘Test Job’ was run on 3/5/2018 at 07:00:00 AM
DURATION: 0 hours, 0 minutes, 17 seconds
STATUS: Failed
MESSAGES: The job failed. The Job was invoked by Schedule 3 (Daily at 7am). The last step to run was step 1 (Run the test script!).

This is useful, but could be far more useful. For starters, the error message is only the job failure message and does not include the detailed failure info from any failed job steps. Ideally, enough details would be provided in the alert to ensure that you could respond immediately, and not need to dig further for error messages every single time this happens. Customizing an alert process allows us to include as much (or as little) detail as we wish in order to make the notifications we receive as actionable as possible!

Building A Better Notification System

To build as simple of a job failure alert system as possible, we’ll follow a handful of steps to plan out and execute this project:

  1. Create tables to store job and job failure details.
  2. Create a stored procedure that logs recent job and failure details to these tables.
  3. Create a job that regularly calls this stored procedure.

A goal in this process is to keep things as basic as possible. There are many opportunities available here for overengineering a perpetual motion machine, but alerting on important failures is ideally simple so as to be as reliable as possible.

We’ll start by building a table that will store a list of SQL Server Agent jobs. Why build a table when MSDB already includes the sysjobs table? If a job is deleted, we want to retain the old job record for posterity. This allows us to report on failures for jobs that may have been recently deleted. It also allows us to retain information about past failures for jobs that no longer exist. Similarly, if we ever were to migrate this database to a new server or install a new version of SQL Server, then having the old job data will ensure that all of our job failure details will remain useful and not be associated with orphaned/unavailable job data in MSDB.

This table, in addition to keys, includes the job name, create/modify times, its category, and flags to indicate whether the job has been disabled or deleted. Feel free to add additional columns for any job metadata that is useful to you, but might not be listed here. We create a surrogate integer clustered primary key to avoid the need to have to index, join, and filter on the larger UNIQUEIDENTIFIER data type. If you are working with servers that have a very short and stable SQL Server Agent job list, then you could easily use a SMALLINT or even a TINYINT for the primary key ID column.

With a SQL Server Agent job table available, we can now build a table to store job failure metrics:

This table contains a foreign key back to our newly created job table above. It also contains details of the failed job, including the start/fail time of the job and details of the error. The last column is a flag that will be used to signify when a failed job has been alerted on successfully, so that we do not repeatedly spam an operator with alerts on it. Note that we will include both the job failure and the step failure messages, allowing for easier troubleshooting directly from the alert, without the need to dig back into SQL Server Agent prior to resolving the problem.

The next step is to create a stored procedure that will check for job failures and place data into sql_server_agent_job_failure accordingly. This will be composed of a handful of steps:

  1. Update sql_server_agent_job with any new, deleted, or changed jobs.
  2. Collect data on new job failures.
  3. Collect data on new job step failures.
  4. Email an operator the details of these failures. Email can be replaced with another communication medium, if desired.

Here is the stored proc declaration. @minutes_to_monitor tells it how far back to check for job failures. This will be set depending on how often you plan on running the monitoring job that calls this proc. My preference is to pull one day’s worth of data. This ensures that in the event of server maintenance, an outage, or some other interruption, we won’t miss any job failures. We’ll filter out already-alerted-on failures as we go, so that won’t be a problem.

We also pull the UTC offset and will store all DATE/TIME data in UTC time. This will result in more math needed up front, but more consistency for anyone that views this data. UTC can be converted to local time by determining the offset and adding it to the UTC time.

This TSQL performs a MERGE into our SQL Server Agent Job table. It matches on job_id, which is unique on each SQL Server and a reliable key for this purpose.

This additional query will check to see if any jobs no longer exist. That is, they are in sql_server_agent_job, but no longer in MSDB. We’ll flag any deleted jobs as disabled and deleted so that anyone reading this data knows that it is stored there for posterity and no longer references an active job.

A warning before we proceed: Dates, times, and statuses within many of the MSDB tables are stored using some outdated (read: scary) conventions. They were not stored as dates, times, datetimes, or even strings. Instead, times were stored as integers. For example, 8:41:53am is stored as 84153. Run durations were stored as integers. For example, a job that ran for 00:12:53 (twelve minutes and fifty-three seconds) will be stored as 1253. Lastly, dates are stored as VARCHAR(8) strings in the format YYYYMMDD. Converting this into more useful data types is critical to being able to meaningfully report on it.

As a result, we’re going to need some math and string manipulation to clean up dates and times that are stored as integer literals:

Much of this TSQL is devoted to converting numeric representations of dates and times into an actual DATETIME that we can compare against. The first CTE cleans up those integers so that the run time and run duration string has leading zeroes to ensure it is 6 characters long. The second CTE converts these now uniform values into DATETIMEs using some very ugly string manipulation. The final SELECT converts those DATETIME values into UTC and places the results into a temporary table for use later in this stored proc.

Our next step is to build a very similar query that will return data on job step failures. We intentionally do this work in a separate query as we will need to join these data sets together, and having them in separate temporary tables will make this significantly easier:

Note that the only significant difference between these queries is that we check for step_id = 0 when looking for overall job notification data and step_id > 0 for job steps that are associated with these jobs. Since sysjobhistory stored both job failures and job step failures, we will need to separate them from each other, and checking for step_id = 0 is a quick and easy way to do so.

Now that we have some data on job and step failures, we can begin generating failure data to insert into sql_server_agent_job_failure for the various failure scenarios that we identified earlier:

Jobs that Fail Due to Failed Steps

The CTE above will group job steps by job and job execution time, placing the last failure first. This allows us to determine which step failure was the direct cause of the job itself failing.

Jobs that Failed Without Any Failed Steps

It is possible for a job to fail without any steps executing. This can be caused by a configuration error, a permissions problem, or some other high-level job or SQL Server Agent issue. We definitely want to know about these, so we check for all job failures that have no corresponding job step failures and insert them into sql_server_agent_job_failure.

Jobs Steps that Fail, but for Jobs that Succeed

Depending on the logic built into job steps, we may allow the job to continue even when a step fails. If this is the case, then SQL Server will not report failure, assuming the remainder of the job succeeds or follows similar rules.

The TSQL above will check for all failed job steps that do not have a corresponding failed job and report on them as well.

Notification

For this demo, we’ll use sp_send_dbmail to email a notification to an operator. If this is not your preferred method of alerting, feel free to substitute this with something else.

Some effort is taken here to structure the job failure details into a formatted HTML email with the failure count at the top and a table of failure data below. This makes reading the emails relatively easy and straight-forward. The following is an example of a job failure that was sent using this process:

These failures were staged, with the first being a scenario where a step fails, but the job succeeds and the second when a job fails due to a failed step. This email format consolidates all failures into a single table that includes details on what failed and why. While only 6 columns are included in this table, you can add more for any additional data that could be useful, such as job duration, number of retries, or further details about the job itself.

The goal of an alert such as this is to reduce the amount of homework that you need to do whenever something breaks. Instead of having to return to SQL Server Agent and read all of the information that could be presented here, you can begin troubleshooting right away.

SQL Server Agent Job

The stored procedure that was created above can be called from anywhere (Scheduled task, Powershell, SQLCMD, etc…), but for simplicity, I’ll use a SQL Server agent job:

This job executes every 5 minutes and sticks to the default of 1440 minutes (one day) for the monitoring interval. Any failures in sql_server_agent_job_failure that have not been flagged as sent to the operator will be sent out as part of any given run. Feel free to adjust the job run frequency to whatever meets your needs.

If a job is deemed mission critical and you need to know of its failure immediately, then you may wish to consider a separate notification on it if waiting up to 5 minutes is too long, or have this process run more frequently. Most processes are flexible and can allow for a short waiting period before we respond, though that priority is determined by you. For example, if a SQL Server instance restarts, we’d probably want to know right now, but if an overnight reporting process fails, waiting a few minutes is probably fine.

Note that I have included a built-in notification on this job that will email me if it fails. This is intentional and answers the question of, “What notifies us of a failure if the job failure notification process breaks”. The only larger issue than this would be if SQL Server Agent would become unavailable. Alerting on this is beyond the scope of this article, but could be accomplished by a service monitor pointed at SQL Server Agent.

Cleanup

Once in place and tested, this alerting system can take the place of any existing alerts. The benefits of this process are:

  1. Failures are stored in tables that can be queried and reported on later, if need be.
  2. Failures are grouped into individual notifications. This prevents a flood of alerts if a job fails frequently.
  3. Additional details are provided that built-in notifications do not report on.
  4. All failure conditions can be reported on, including unusual ones that SQL Server Agent may not catch.

Conclusion

Ultimately, reporting and alerting on failures are done on a case-by-case basis. This process was written to be as simple as possible, and therefore can be customized until it fully meets your needs. As an added bonus, we can create well-designed schema that relies on easy-to-understand data types for dates, times, statuses, and duration.

Creating your own job failure alerting process allows you to take charge of alerting and make the resulting notifications as meaningful and useful as possible. This is important as a major goal of alerting is to make the messages we get as actionable and informative as possible, without producing noise or distractions. We also do not want to miss potentially important failures that result from a misconfigured job.

Given the limitations of the build-in alerting options in SQL Server, this also provides us functionality that is not possible otherwise. We can adjust notifications to react to failure states such as failed steps or misconfigured jobs. We can also customize the notifications we receive to include additional information that allows us to jump straight into troubleshooting, without the need to revisit SQL Server Agent and collect more troubleshooting data.

Effective alerting improves our productivity, decreases distractions, but most importantly, it reduces late-night wake-up-calls, which is something we can all get behind!

Downloads


Ed Pollack
Alerting, Jobs, Reports

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

168 Views