Ed Pollack

Mapping schema and recursively managing data – Part 2

November 20, 2015 by

Introduction

In Part 1 of this article, we built a case for the benefits of documenting our schema via the use of a stored procedure, as well as the ways this information can be used for DML operations on a relational database. Below, we will continue where we left off by speeding up our SQL queries and then putting everything together so that we can demo a usable script that will hopefully make your life easier!

Optimization

One additional step that has not been addressed yet, but should be, is optimization.  As we collect row counts, we will undoubtedly find many relationships with row counts of zero.  For the sake of deleting data, these relationships may be removed from our foreign key data set.  But wait—there’s more!  Any relationship that is a child of a parent with zero rows can also be deleted.  This is an immense optimization step as we can slice off large swaths of data very quickly.  Even more importantly, each future iteration of our WHILE loop won’t have to touch that newly defined chunk of irrelevant relationship data.  In tests I conducted on large test databases, these additional steps reduced runtime by as much as 95%. 

The following change to our TSQL from above illustrates this optimization procedure:

By adding in a row count check, we can choose one of two scenarios:

  1. There are no rows of data to process: delete this foreign key relationship and all that are children of it.
  2. There are rows of data for this relationship, update @foreign_keys with the necessary information.

The generation of the INNER JOIN data that is stored in #inner_join_tables is a bit complex, and is written as it is to ensure that we never join into the same relationship twice or incorrectly.  We want to the correct path from the current relationship back to the target table through a sequence of unique column relationships.  The additional WHERE clauses guard against a handful of important scenarios:

  1. If a relationship could theoretically loop back through a table we have already referenced, prevent further processing.  Otherwise, we would allow infinite loops to occur.
  2. Do not process the target table in the loop.  Handle it separately as it is a special case.
  3. If multiple relationship paths exist between two tables, ensure that only the current path is traversed.  The @has_same_object_id_hierarchy variable checks for identical table paths and allows for extra logic to be included when this happens.

At this point, we can take all of the data gathered above and use it to generate a list of DELETE statements, in order, for the target table and WHERE clause provided in the stored procedure parameters.

Comments are added into the PRINT statements with the maximum number of rows to be deleted.  This is based on join data and may be rendered inaccurate as execution occurs, as a row of data may belong to several relationships, and once deleted will not be available for deletion as part of any others.  These estimates are useful, though, in gauging the volume of data that each relationship represents.

The bulk of the work in our stored procedure is done when the counts are calculated.  The deletion section is iterating through a similar loop and printing out the appropriate delete TSQL, in order, for each relationship that was previously defined and enumerated.

There is one final task to manage, and that is self-referencing relationships.  If a table has a parent-child relationship with itself, we explicitly avoided it above.  How to properly handle these relationships, should they exist, is up to you.  Whether we cascade DELETE statements through the rest of our work or simply set the foreign key column to NULL would be based on the appropriate business logic.  In the following TSQL, we set any of these relationships to NULL that happen to directly relate to the target table.  We could also tie it into our big loop and traverse all relationships previously defined, but I have left this edge case out as it is not too common:

This logic is simplified, but at least identifies relationships where data exists, and provides some sample TSQL that could be used to clear them out in their entirety, if that is the best approach for your data.

Putting Everything Together

With all of our objectives completed, we can now piece together our stored procedure in its final form.  The relationship data is selected back to SSMS at the end as a reference, so you can easily view & save this data as needed:

We can run some test scenarios on Adventureworks:

This will return 14 relationships that stem from Production.Product, up to 2 levels away from that table.  The DELETE statements are printed to the text window and are directly tied to each row of relationship output from above. 

Another example can be run that shows an additional level of relationship abstraction, and the 15 relationships that exist as part of it:

We can look at one relationship from this example to see what data was output and why:

foreign_key_id: 20 referencing_object_id: 1154103152

referencing_schema_name: Sales
referencing_table_name: SalesOrderDetail
referencing_column_name: SpecialOfferID
primary_key_object_id: 414624520
primary_key_schema_name: Sales
primary_key_table_name: SpecialOfferProduct
primary_key_column_name: SpecialOfferID
level: 3
object_id_hierarchy_rank: 418100530-1973582069-414624520-1154103152
referencing_column_hierarchy_rank: ProductModelID-ProductID-SpecialOfferID

This charts the foreign key relationships Sales. SalesOrderDetailSalesSpecialOfferProductProduction.Product, and Production.ProductModel as is shown in the following ERD:

The resulting DELETE statement for this relationship is as follows:

Conclusion

The ability to quickly map out a complete relationship hierarchy can be a huge time-saving tool to have at one’s disposal.  The stored procedure presented here is intended to be a base research script.  You, the user, can customize to your heart’s content, adding additional parameters, print options, or unique logic that is exclusive to your business.  Deletion was the target usage, but a similar process could be used in any application where complete hierarchical knowledge of a database (or part of a database) is needed.  This stored procedure can be used as-is with no modifications, and will still provide quite a bit of information on your database schema.

If you are working in a relational environment that lacks foreign keys, you can still utilize this approach, but would have to define those relationships in a table, file, or TSQL statement.  Once available, the stored procedure could be modified slightly to take that data as an input, format it as the system views would, and continue as though it were the same data.

Leaving off the WHERE clause allows you to get a complete relationship tree if one were to want to touch all data in a target table.  If your interest was a database map with no omissions, you could drop the DELETE statement that was added in for efficiency.  The resulting stored procedure will take more time to execute, but will provide all possible relationships from a given entity, which could be useful when researching code changes or data integrity.

Regardless of usage, knowledge is power, and the ability to obtain large amounts of schema-dependent information quickly & efficiently can turn complex tasks into trivial bits of work.  The ability to customize those processes allows for nearly limitless applications and the ability to conduct research and reconnaissance that would otherwise be extremely labor-intensive, error-prone, or seemingly impossible!


Ed Pollack

Ed Pollack

Ed has 15 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster.He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit.This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region.

In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate.

View all posts by Ed Pollack
Ed Pollack
SQL Database design

About Ed Pollack

Ed has 15 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

159 Views