Seth Delconte

Parsing and rotating delimited data in SQL Server

December 9, 2015 by

Because databases aren’t always designed efficiently, data is sometimes stored in ways that make sorting and searching extremely difficult. Requests for meaningful, informative reports from such data, however, still must be fulfilled.

Note: This applies to SQL Server 2012 and higher.

SCENARIO

I have a preexisting table of 96 rows of company store locations and their respective phone contact numbers. There is one record for each store, each having a single ‘PhoneNumbers’ field that contains three separate phone numbers: the main store number with or without an extension, a secondary store number, and a mobile phone number. The phone data is very messy (all three numbers are crammed together into one field) and cannot easily be searched or sorted.


As you can see, the store numbers are delimited by either a slash (/) or an ampersand (&). The ‘main’ phone numbers are followed by one (or none) of the following text: ‘x’, ‘ext’, ‘extension’; an extension number; and then the text ‘ofc’, ‘main’, or ‘office’. The ‘secondary’ phone numbers are trailed by either the text ‘2nd’ or ‘secondary’; while the final ‘mobile’ numbers are followed by the text ‘cell’ or ‘cellphone’.

Our objective is to parse the three separate kinds of phone numbers, implement consistency, and return the parsed data so that each phone number is on a separate row (unpivoting).

PARSENAME

Our first step will be to parse the phone numbers on the delimiters. One method that neatly accomplishes this is the using the PARSENAME function. The PARSENAME function is logically designed to parse four-part object names – distinguishing between server name, database name, schema name, and object name:


The syntax to use for PARSENAME is:


The first parameter is the object to parse, and the second is the integer value of the object piece to return. The nice thing about PARSENAME is that it’s not limited to parsing just SQL Server four-part object names – it will parse any function or string data that is delimited by dots:



Let’s use PARSENAME to separate the three types of phone numbers. To simplify things, let’s use a bit of code with the REPLACE function to replace the delimiters with the expected ‘object part’ separator (dot):


We should now have the PhoneNumbers field values returned in this format: [main number].[secondary number].[mobile number]. Now we can use PARSENAME against the table’s delimiter-replaced data. PARSENAME will see the PhoneNumbers string as follows: [main number] = ‘object name’, [secondary number] = ‘schema name’, and [mobile number] = ‘database name’ – because of the order of the delimited data positions. We’ll use a Common Table Expression (CTE) called ‘replaceChars’ to run PARSENAME against the delimiter-replaced values. A CTE is useful for returning a temporary view or result set. We’ll return the original PhoneNumbers field as well, for comparison purposes:


Based on the replaced delimiter values, PARSENAME returns each section of the PhoneNumbers field by its position in the data string. Each of the three types of numbers now has its own field.

UNPIVOT

Our next task is to make sure each number for a given store is displayed on a separate record. Essentially, we want to convert some columns into rows. The UNPIVOT function was designed exactly for scenarios like this. Since we have three distinct types of values (Main, Secondary, and Mobile), we can designate them to be rotated and consolidated into a new field (‘Number’) and labeled by another new field (‘Precedence’) – all by using UNPIVOT:


Note that each store now has three records to accommodate the three types of phone number. The UNPIVOT function has allowed us to effectively rotate the three new phone number columns into rows – facilitating a more normalized table structure.

CLEANING THE DATA

Now that we’ve done the heavy lifting needed to parse and rotate the phone numbers, let’s spend some time cleaning up the inconsistencies in the data.

One improvement that can be made is to pull out the extension numbers that are associated with the Main numbers. We know that all extensions are referenced by either ‘x’, ‘ext’, or ‘extension’, so we can use the SUBSTRING and CHARINDEX functions to extract the single-digit extension numbers. We’ll generate a new field (‘Extension’) for these values in the return statement:


We’ve successfully extracted the extension numbers from the Main numbers into a new field. Remember, only Main numbers have an associated extension, so the others will be NULL.

Our next challenge is to do the following:

  1. Remove the intermittent dashes (-) from all numbers for standardization.
  2. Remove leading whitespaces.
  3. Separate the numbers from all trailing text.
  4. Prepend the four-digit Secondary numbers with the first six numbers from its respective store’s Main number – it can be assumed that each Secondary phone has the same area code and exchange as its associated Main number does.

For steps one and two, we’ll employ the REPLACE and LTRIM functions:


This combination removes leading whitespaces and eliminates dashes.

For step three, we’ll wrap the above solution in the SUBSTRING function:


For Secondary numbers, we’ll use a different ‘length’ argument value, since they have only four digits:


LAG

For our fourth step, we can use the LAG function (new for SQL Server 2012) – it allows referencing of previous records (exactly what we need to do to grab the first six characters from each store’s Main number):


We’ll append those results with the extracted Secondary number. Putting it all together, we have a complete solution. We can now also eliminate the original PhoneNumbers field:


ANALYSIS & FINAL STEPS

The Locations table’s data has now been parsed, rotated, cleaned, and standardized to the given requirements. It is now in a highly searchable and sortable format, and can be easily inserted to a new, permanent table that could replace the original.

Some remaining steps that could be tackled would be to

  • Normalize the data further – for example, if a ‘Store’ table exists, we could replace the Store field in the Locations data with one containing Store IDs, and establish referential integrity with a foreign key to the primary key of the Stores table.
  • Add appropriate indexes to the tables based on historical or anticipated queries.

See more

Consider these free tools for SQL Server that improve database developer productivity.

Seth Delconte
Latest posts by Seth Delconte (see all)
168 Views