Ed Pollack

Mapping schema and recursively managing data – Part 1

November 18, 2015 by

Introduction

In a typical OLTP environment, we want to maintain an acceptable level of data integrity.  The easiest way to do this is through the use of foreign keys, which ensure that the values for a given column will always match that of a primary key in another table.

Over time, as the number of tables, columns, and foreign keys increase, the structure of that database can become unwieldy.  A single table could easily link to thirty others, a table could have a parent-child relationship with itself, or a circular relationship could occur between a set of many tables.

A common request that comes up is to somehow report on, or modify a set of data in a given table.  In a denormalized OLAP environment, this would be trivially easy, but in our OLTP scenario above, we could be dealing with many relationships, each of which needs to be considered prior to taking action.  As DBAs, we are committed to maintaining large amounts of data, but need to ensure that our maintenance doesn’t break the applications that rely on that data.

How do we map out a database in such a way as to ensure that our work considers all relationships?  How can we quickly determine every row of data that relates to a given row?  That is the adventure we are embarking upon here!

Problem

It is possible to represent the table relationships in a database using an entity-relationship diagram (ERD), which shows each primary key & foreign key for the set of tables we are analyzing.  For this example, we will use AdventureWorks, focusing on the Production.Product table and relationships that can affect that table.  If we generate a complete ERD for AdventureWorks, we get a somewhat unwieldy result:

Not terribly pretty, but it’s a good overview that shows the “hot spots” in the database, where many relationships exist, as well as outliers, which have no dependencies defined.  One observation that becomes clear is that nearly every table is somehow related.  Five tables (at the top) stand alone, but otherwise every table has at least one relationship with another table.  Removing those five tables leaves 68 behind, which is small by many standards, but for visualizing relationships, is still rather clunky.  Generating an ERD on very large databases can yield what I fondly refer to as “Death Stars”, where there are hundreds or thousands of tables, and the diagram puts them in a huge set of concentric circles:

Whether it is a Spirograph or database is up to the viewer, but as a tool, it is more useful as wall art than as science.

To simplify our problem, let’s take a small segment of AdventureWorks that relates to the Product table:

This ERD illustrates 13 tables and their dependencies.  If we wanted to delete rows from Production.Product for any products that are silver, we would immediately need to consider all dependencies shown in that diagram.  To do this, we could manually write the following queries:

While these queries are helpful, they took a very long time to write.  For a larger database, this exercise would take an even longer amount of time and, due to the tedious nature of the task, be very prone to human error.  In addition, order is critical—deleting from the wrong table in the hierarchy first could result in foreign key violations.  The row counts provided are total rows generated through the join statements, and are not necessarily the counts in any one table.  If we are ready to delete the data above, then we can convert those SELECT queries into DELETE statements, run them, and be happy with a job well done:

Unfortunately, the result of running this TSQL is an error:

The DELETE statement conflicted with the REFERENCE constraint “FK_SpecialOfferProduct_Product_ProductID”. The conflict occurred in database “AdventureWorks2012”, table “Sales.SpecialOfferProduct”, column ‘ProductID’.

It turns out there are relationships to tables outside of the Production schema in both Purchasing and Sales.  Using the full ERD above, we can add some additional statements to our delete script that will handle them:

That executed successfully, but I feel quite exhausted from all the roundabout effort that went into the deletion of 43 rows from a table. Clearly this manual solution will not be scalable in any large database environment. What we need is a tool that can intelligently and quickly map these relationships for us.

Solution

We want to build a stored procedure that will take some inputs for the table we wish to act on, and any criteria we want to attach to it, and return actionable data on the structure of this schema.  In an effort to prevent this article from becoming unwieldy, I’ll refrain from a detailed explanation of every bit of SQL, and focus on overall function and utility.

Our first task is to define our stored procedure, build parameters, and gather some basic data about the table we wish to act on (called the “target table” going forward).  Deletion will be the sample action as it is the most destructive example that we can use.  We will build our solution with 3 basic parameters:

@schema_name: The name of the schema we wish to report on
@table_name: The name of the table we wish to report on (target table).
@where_clause: The filter that we will apply when analyzing our data.

For step one, we have also added a row count check.  In the event that the filter we apply to the target table results in no rows returned, then we’ll exit immediately and provide an informational message to let the user know that no further work is needed.  As a test of this, we can execute the following SQL, using a color that is surely not found in Adventureworks:

The result is exactly as we expected:

There are no rows to process based on the input table and where clause.  Execution aborted.

There is no other output or action from the stored proc, so far, but this provides a framework to begin our work.

The first hurdle to overcome is collecting data on our schema and organize it in a meaningful fashion.  To process table data effectively, we need to turn an ERD into rows of metadata that describe a specific relationship, as well as how it relates to our target table.  A critical part of this task is to emphasize that we are not just interested in relationships between tables.  A set of relationships is not enough to completely map all data paths within a database.  What we are truly interested in are data paths: Each set of relationships that leads from a given column back to our target table.

A table can be related to another via many different sets of paths, and it is important that we define all of these paths, so as not to miss any important relationships.  The following shows a single example of two tables that are related in multiple ways:

If we wanted to delete from the account table, we would need to examine the following relationships:

account_contract – – – > account (via account_id)
account_contract – – – > employee_resource (via contract_owner_resource_id)
account – – – > account_resource (via account_primary_resource_id)
account_contract – – – > employee_resource (via account_id and account_primary_resource_id)

The last relationship is very important—it illustrates a simple example of how it is possible for two tables to relate through any number of paths in between.  It’s even possible for two tables to relate through the same intermediary tables, but using different key columns.  Either way, we must consider all of these relationships in our work.

In order to map these relationships, we will need to gather the appropriate schema metadata from a variety of system views and recursively relate that data back to itself as we build a useful set of data with which to move forward on:

The TSQL above builds a set of data, centered on the target table provided (in the anchor section of the CTE), and recursively maps each level of relationships via each table’s foreign keys. The result set includes the following columns:

foreign_key_id: An auto-numbering primary key.
referencing_object_id: The object_id of the referencing table
referencing_schema_name: The name of the referencing schema
referencing_table_name: The name of the referencing table
referencing_column_name: The name of the specific referencing column for the referencing table above
primary_key_object_id: The object_id of the table referenced by the referencing table above
primary_key_schema_name: The schema name of the primary key table.
primary_key_table_name: The table name of the primary key table.
primary_key_column_name: The name of the primary key column referenced by the referencing column.
level: How many steps does this relationship path trace from the target table to the referencing table?  This provides us the ability to logically order any operations from most removed to least removed.  For delete or update statements, this is crucial.
object_id_hierarchy_rank: A list of each table’s object_id within the relationship tree.  The target table is on the left, whereas the referencing table for each relationship is on the right. This will be used when constructing TSQL statements and optimizing unused TSQL.
referencing_column_name_rank: A list of the names of the referencing columns.  This will be used later on for optimizing and removing irrelevant statements.

There are 2 WHERE clauses that are worth explaining further:

This ensures that we don’t loop around in circles forever.  If a relationship exists that is circular (such as our account example earlier), then an unchecked recursive CTE would continue to increment the level and add to the relationship tree until the recursion limit was reached.  We want to enumerate each relationship path only once, and this guards against infinite loops and repeated data.

There is a single caveat that was explicitly avoided above: self-referencing foreign keys.  In an effort to avoid infinite loops, we remove any foreign keys that reference their own table.  If the referencing and referenced tables are the same, then we will filter them out of our result set immediately and deal with them separately.

We’ve explicitly excluded relationships from a table to itself, and are now obligated to do something about that.  To collect this data, we do not need a recursive CTE.  A set of joins between parent & child data will suffice:

We can return data from this table (if needed) with one additional query:

We now have all of the data needed in order to begin analysis.  We have a total of 3 goals to achieve here:

  1. Get counts of data that fit each relationship.
  2. If there are zero rows found for any relationships, then we can disregard them for the sake of deleting data. This will greatly speed up execution speed & efficiency on larger databases.
  3. Generate DELETE statements for the relevant data identified above.
  4. Collecting row counts will require dynamic SQL in order to query an unknown list of tables and columns. For our example here, I use SELECT COUNT(*) FROM in order to return row counts. If you are working in tables with significant row counts, then you may find this approach to be slow, so please do not run the research portion of this stored procedure in a production environment without some level of caution (using a READ UNCOMMITTED isolation level removes contention, though it won’t speed things up much).

    The following TSQL defines some new variables and iterates through each relationship until row counts have been collected for each relationship:

    3 new columns have been added to our @foreign_keys table:

    processed: A bit used to flag a relationship once it has been analyzed.
    row_count: The row count that results from our work above.
    join_condition_sql: The sequence of INNER JOIN statements generated above is cached here so that we do not need to perform all of this work again in the future.

    The basic process followed is to:

    1. Collect all relevant information about a single foreign key relationship.
    2. Build all of the INNER JOINs that relate this foreign key back to the target table via the specific relationship defined in step 1.
    3. Execute the count TSQL.
    4. Store the output of the count TSQL in our @foreign_keys table for use later.

    Conclusion (Until Part 2)

    We’ve built a framework for traversing a hierarchy of foreign keys, and are well on our way towards our goal of effective schema research. In Part 2, we’ll apply some optimization to our stored procedure in order to speed up execution on larger, more complex databases. We’ll then put all the pieces together and demo the result of all of this work. Thanks for reading, and I hope you’re enjoying this adventure so far!


    Ed Pollack
Relationships and dependencies

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

168 Views