Ed Pollack

SQL Server Job Performance – Tracking

January 19, 2017 by

Description

Keeping track of our SQL Server Agent jobs is a very effective way to control scheduling, failures, and understand when undesired conditions may be manifesting themselves. This is a brief journey towards a solution that provides us far greater insight into our jobs, schedules, and executions than we can glean from SQL Server by default.

Introduction

When we consider SQL Server performance metrics for use by database professionals, a long list comes to mind:

  • Waits
  • CPU
  • Memory utilization (buffer cache, plan cache, server memory)
  • Contention (locks, deadlocks, blocking)
  • I/O (reads, writes by database/file/object)
  • Many, many more…

A metric often overlooked is job performance. The alert we most often respond to is a failed job. When a backup, warehouse load, or maintenance task fails, we are alerted and know that some response is needed to ensure we understand why and how it failed.

Job failures are not always avoidable, but our goal is to prevent systemic or catastrophic failures. If we discover that an important ETL process is taking 23 hours to complete per day, then it is obvious we need to revisit the job, the work it performs, and rearchitect it to be more efficient. At that time, though, it is too late. We need to keep our important data flowing, but at the same time ensure that the long-running process doesn’t slow to the point of being ineffective, or fail.

Our goal is to take job-related metadata and run statistics from MSDB and compile them into a simple but useful store of reporting data that can be trended, monitored, and reported on as needed.

What Are the Benefits of Job Performance Tracking?

The immediate benefits of tracking this sort of data are obvious in that we can keep an eye out for very long-running jobs and (hopefully) catch a problematic situation before it becomes a 2 am wake-up call!

There are additional insights we can glean from this data that can make it even more useful, such as:

  • Locate short-running jobs that run unusually fast, indicating that they are not performing the expected workload. For example, if the full database backup of a 10TB database completes in 5, we know something is very wrong. Especially if the average runtime of this job is typically 2 hours!

  • Trend job runtime and predict how long jobs will take in the future. This allows us to predict when we may need to evaluate the efficiency of current processes before they become problematic.
  • Track individual job steps to focus on larger jobs and which parts might need attention.
  • Aggregate job performance across a sharded or multi-server environment to determine overall efficiency.
  • Improve our ability to schedule new jobs, by being able to easily view job scheduling data on a server.
  • Allow us to better plan server downtime by reviewing job scheduling and runtime data. This provides us with information on what jobs could be missed (or need to be run later) in the event that a server is brought down. This could also be useful after an unexpected outage, allowing us to review any missed or failed jobs en masse.

With some creativity, other uses can be concocted for this data, including tracking SSIS package execution, replication performance, backups, index maintenance, ETL processes, warehouse data loads, and more!

Why Doesn’t Everyone Do This?

As with many alerting, monitoring, or trending processes, we often don’t see a need for them until something breaks and a need arises. Until a job or related process fails, the need to pay attention to runtimes may not be realized.

To make matters more difficult, SQL Server does not provide a spectacular way to monitor and trend job progress and results over time. SQL Server Agent contains a GUI that can show us details on recent job outcomes, as well as the ability to monitor jobs in progress:

From the SQL Server Agent menu, we can click on Job Activity Monitor in order to view the up-to-the-minute status of each job. From the jobs menu, the history for all or any subset of jobs can be viewed, as well as for any single job:

This information is useful, but its utility is limited in a number of ways:

  • Job history is stored in MSDB for a limited amount of time, based on a server’s configuration. Metrics about previous job runs may be removed once past a set retention period.
  • There are few ways to trend or view any metrics over a period of time.
  • Viewing the combined job history for multiple servers is not intuitive.
  • We have little ability to customize how this data is presented to us.

An alternative to navigating job history data via the GUI is to use a handful of views within MSDB, each of which provides data on job configuration, job schedules, and job history:

  • MSDB.dbo.sysjobs: Provides a list of all jobs on the local SQL Server, along with relevant metadata.
  • MSDB.dbo.sysschedules: Contains all job schedules defined on the local server, along with their details.
  • MSDB.dbo.sysjobhistory: Contains a row for each job and job step that is executed on the local server.
  • MSDB.dbo.sysjobschedules: Contains a row per job/schedule relationship.
  • MSDB.dbo.sysjobactivity: Contains details on recent job activity, including accurate future runtimes.

These views provide valuable information but are not easy to read and consume. While Microsoft provides documentation on their purpose and contents, many of the data types are suboptimal, and the mixed contents of sysjobhistory are not intuitive.

For example, dates and times are stored as integers. 03:15:05 is stored as 31505 and 9/23/2016 is 20160923. Job run duration is stored as an integer in the format HHMMSS, such that 85 seconds would appear as 125 (one minute and twenty-five seconds). Job schedules are stored using numeric constants to represent how often they run, run time, and run intervals.

Given the unintuitive interface for pulling job history data, it’s often seen as not worth the time to pull, convert, and present the data unless absolutely needed. Alternatively, enterprise job scheduler software can be purchased that will manage SQL Server Agent jobs. While convenient, software costs money and not all companies can or want to spend resources on this sort of software.

Building a Job Performance Tracking Solution

Given the constraints presented thus far, we can choose to build our own solution for collecting and storing job performance metrics. Once collected, reporting on them becomes a simple matter of defining what we want to see and writing some simpler queries to crunch our data appropriately.

The steps we will follow to create a self-sufficient solution to the problem presented above are:

  1. Create new tables that will store job performance metrics.
  2. Create a stored procedure to collect job metrics and store them in the tables above.
  3. Clean up old data, as prescribed by whatever retention rules we deem necessary.
  4. Schedule a job to regularly collect our job performance data.
  5. Create reports that consume this data and return useful results (see the next section).

As we build this solution, feel free to consider ways of customizing it. Add or remove columns or tables that you don’t think are important to your needs and adjust processes to fit the data you need. The beauty of any hand-built process is the ability to have complete control over customization and implementation!

Tables

We’ll create four tables to store data pertinent to our metrics. Two will be dimension tables that include basic information about jobs and schedules. The other two will contain job and job step execution results. These tables are structured like warehouse tables in order to facilitate easier consumption by reporting products or processes, though you are free to name & structure based on whatever standards you typically follow.

This table stores a row per Agent job, with the Sql_Agent_Job_Id being pulled directly from MSDB.dbo.sysjobs. Since this is a GUID, we choose to create a surrogate key to represent the clustered primary key on the table to help improve performance when writing to this table.

Persisting this data allows for retention of job data, even if a job is disabled or deleted on the server. This can be useful for understanding how a job previously performed, or how a new and old version of a process compare.

Similar to the previous dimension table, this one stores a row per schedule, as found in MSDB.dbo.sysschedules, with schedule_id being the unique identifier per schedule. This data is somewhat optional in that we do not need schedule details in order to understand when jobs run, for how long, and their results, but the added information is useful for understanding what schedules are used, and for what jobs.

If desired, we could also create a linking table that illustrates the relationships between jobs and schedules, allowing us to understand when jobs are supposed to run, which schedules they run under, and predict future job schedules. This information is also not needed in order to fully comprehend job history but could be useful for predicting the best times for planned maintenance, outages, or to effectively schedule new jobs. Data for this task can be pulled from MSDB.dbo.sysjobschedules and is a linking table that contains a single row per job-schedule pairing (a one-to-many relationship). We can easily create a small table to store and maintain this data:

A surrogate key is used as a more reliable clustered index, but a combination of job_id and schedule_id would also work. Next_Run_Datetime is optional, but could be handy under some circumstances. Note that the accuracy of this column will be based on how frequently this data is updated. If the job runs more often than the collection of our job performance data, then this column will not always be up-to-date.

Now we can introduce the table where job run statistics will be stored. Each row represents a single job run for a given job_id, when it started, its duration, and its completion status. The end time is a computed column, as we can determine it easily once the start time and duration are known. The status will contain a friendly string indicating the job result: Failure, Success, Retry, or Canceled. Note that the schedule that triggered the job is not referenced here. As a result, schedule data is not required for this to work but is nice to have in general.

This table is similar to our last one, and stores the metrics for individual job steps. Completed job data is omitted from this data as it is stored in the other table. In addition, steps in progress are not recorded—only those that have completed. Depending on your use-case, there could be value in combining job and job step data into a single table, much like how MSDB.dbo.sysjobhistory stores it. Here, we choose to separate them as we may not always want individual step data, and having to carve this out from a larger data set could be a nuisance in terms of report/script development and performance.

Metrics Collection Stored Procedure

Now that we have built a number of custom tables to store our job performance data, the next step is to create a process that pulls the data from MSDB, transforms it into a more user-friendly form, and stores it in those tables. We will manage the gathering of data for each table separately, allowing for easier customization and testing of our code.

The name is arbitrary, but this seems descriptive enough. No parameters are used by the stored procedure as I intentionally want to keep this as simple as possible. If you have a need to pass in additional configuration or timing options, doing so should be relatively easy.

We’ll collect job schedules first as this data is relatively small & simple. The TSQL above allows us to find the date/time of the last change in our data set and only collect modifications that have occurred since then. If no data exists, then we set our last modify date/time to a very old date, in order to ensure we collect everything on the first job run.

With this housekeeping out of the way, we can proceed to collect all schedule data from the local SQL Server. A MERGE statement is used for convenience, as it allows us to insert and update rows appropriately all at once. While this TSQL looks long, it’s messiness is primarily due to the need for us to convert integer identifiers in MSDB.dbo.sysschedules into more readable data types. These conversions are not pretty, and there are many ways to accomplish this, but by fixing our data now, we make using it later much, much easier.

Essentially, this code collects all schedule data for those modified since our last modified date/time and either inserts rows for new schedules or updates for those that already exist. All of the case statements are used to convert the schedule frequency, interval, intervals, and recurrence into strings that can be easily consumed by reports or metrics collection processes. The integers are easy to use for system processing, but not easy for humans to consume. Our primary goal here is to create a simple data store that is easy to use by anyone, even if the user is not someone terribly familiar with the underlying MSDB data.

An additional segment of code can be added to manage whether a schedule is deleted or not. This is optional, but a nice way to maintain some schedule history on those that had previously been used, but have been deleted:

For any schedule not found on the server, the Is_Enabled and Is_Deleted bits are set appropriately.

With schedule data in hand, we can now pull job data, allowing us to maintain a full list of all SQL Server Agent jobs. Comparatively, job data is far simpler than schedules, as there is no date/time data or other temporal information encoded in suboptimal data types. To start off, we’ll collect the most recent job data update from our data set:

As with schedules, this cuts down the data set we collect from MSDB to only include jobs that were created or updated since the last collection time. With that complete, we can use a MERGE statement to pull in jobs data from MSDB.dbo.sysjobs.

When a job is not found in our data, it is inserted, and when it is found, it is updated. Lastly, we check for deleted jobs and update as appropriate, similar to with schedules:

Job associations can be tracked by pulling directly from MSDB.dbo.sysjobschedules, but the next run date is not updated constantly. There is a delay in SQL Server before a background process runs and updates this data. If you need up-to-the-minute accuracy, then use MSDB.dbo.sysjobactivity, which is updated as jobs execute and complete. As a bonus, the date/time columns are actually stored in DATETIME data types!

The following script pulls the assignment details first, and then joins back to pull the next run date/time:

The primary difference in this MERGE statement is that we will delete any associations that do not exist as there is little benefit in maintaining old relationships. The next run time is the same for all assigned schedules, in case more than one exists on a job. This keeps our data as simple as possible for future use.

The next two blocks of TSQL comprise the bulk of our stored procedure and encompass the actual job duration metrics collection.

A common table expression (CTE) is used to reformat the date/time/duration data, and then the subsequent INSERT INTO…SELECT statement finishes the formatting and places the data into our tables.

To put the data types used in MSDB.dbo.sysjobhistory into perspective:

  • Run_status is an integer that indicates 0 (failure), 1 (success), 2 (retry), and 3 (canceled).
  • Run_date is an integer representation of the YYYMMDD date of the job execution.
  • Run_time is an integer representation of the HHMMSS time of the job execution
  • Run_duration is the amount of time the job ran for, formatted as HHMMSS.

Since integers are used for dates, times, and duration, there is the possibility that the number of digits will vary depending on the time of the day. For example, a time of “20001” would indicate 8:00:01 pm with no need for a leading zero. This added complexity ensures some necessary gnarly string manipulation in order to be certain that the resulting numbers are valid DATETIME values, rather than INT or VARCHAR representations.

This TSQL pulls exclusively those rows with step_id = 0, which indicates overall job completion. Any job steps numbered 1 or higher correspond to each job step within a job, which are tracked in a separate fact table, as indicated below.

This final segment of TSQL is almost identical to the previous job completion metrics collection script, except here we gather all rows in which the step_id is greater than zero. This intentionally omits overall job duration metrics and only includes individual step completion. Some notes on job step data:

  • Step names and details are not included. If this is something you’d like to collect & save, they are easy to add.
  • If a job fails and exits before all steps complete, then any subsequent steps will not appear in the results for that execution as the job never actually reached them in the step list. In other words, if a step never executes, it will not be represented in this data.
  • A job can succeed even if a step fails, depending on the logic used in the advanced job step configuration:

    “On failure action” can be adjusted to continue through the job and post success, later on, assuming underlying issues are resolved or are deemed unimportant by this branching logic.

At this point, we have a complete script that can be executed anytime in order to collect job execution statistics. All of the underlying work can be broken into five distinct sections: one for each table. From here, we need to figure out what to do with this data in order to make the best use of it!

Cleanup

As with any data collection process, we want to clean up or archive old data to ensure that we are not storing data so old that it has become irrelevant and overly space-consuming. Only two tables contain data that can accumulate heavily over time: Fact_Job_Run_Time and: Fact_Job_Step_Run_Time. Cleaning up data from these tables is as simple as choosing a cutoff period of some sort and enforcing it, either in the job that collects this data or in a more generalized process elsewhere on your server.

The following TSQL removes all job execution data older than a year:

Alternatively, we could limit data storage by the number of rows per job, per job step, or via any other convention that matches your business needs. If this data is to be crunched into more generic metrics per server, such as average duration per day or number of failures per month, then it may be possible to hang onto far less than a year of detail data.

It is important to distinguish between detail data, as stored in these tables and reporting data, that can be gleaned from these tables or stored elsewhere as needed. We’ll discuss reporting later in this article in order to create some examples of what we can do with job performance metrics.

Customization

A process such as this begs for customization. Some of the tables shown here, such as job-schedule pairings or job step runtimes may or may not be important to you. There may be columns, such as job status, that you do not need. Alternatively, MSDB may have some columns in its tables that I did not include that may be useful to you.

The collection procedures and reporting scripts created here are very flexible—feel free to customize as you see fit. The general process is the same:

  1. Collect detailed job execution data.
  2. Store collected data in separate fact tables.
  3. Create further aggregation/number crunching as needed and store results in new tables.
  4. Generate reports off of the crunched data in order to satisfy trending/monitoring needs.

I have provided many ideas for customization throughout this article. If you find any creative or interesting uses for this data that are not outlined here, feel free to contact me and let me know! I enjoy seeing how processes such as this are changed over time to solve new, complex, or unexpected challenges.

Conclusion

Collecting, aggregating, and reporting on job performance metrics is not something that we often consider when planning how to monitor our SQL Servers. Despite it being less obvious than CPU, memory, or contention metrics, these stats can be critical in an environment where SQL Server Agent jobs are relied upon in order to know when jobs run far too long or to see the trending of performance over time.

Consider what metrics are needed in your environment, customize to get those details, and make the best use of the data collected. This sort of information allows us to identify and solve performance problems before they result in failures or timeouts, and can likewise avoid the emergencies and dreaded late-night wake-up calls that often follow.

Tracking-Job-Performance-in-SQL-Server_1-1.zip
Ed Pollack
Jobs, Performance

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

168 Views