Sifiso Ndlovu

SQL replace: How to replace ASCII special characters in SQL Server

August 7, 2017 by

One of the important steps in an ETL process involves the transformation of source data. This could involve looking up foreign keys, converting values from one data type into another, or simply conducting data clean-ups by removing trailing and leading spaces. One aspect of transforming source data that could get complicated relates to the removal of ASCII special characters such as new line characters and the horizontal tab. In this article, we take a look at some of the issues you are likely to encounter when cleaning up source data that contains ASCII special characters and we also look at the user-defined function that could be applied to successfully remove such characters.

Replacing ASCII Printable Characters

The American Standard Code for Information Interchange (ASCII) is one of the generally accepted standardized numeric codes for representing character data in a computer. For instance, the ASCII numeric code associated with the backslash (\) character is 92. Many of the software vendors abide by ASCII and thus represents character codes according to the ASCII standard. Likewise, SQL Server, which uses ANSI – an improved version of ASCII, ships with a built-in CHAR function that can be used to convert an ASCII numerical code back to its original character code (or symbol). Script 1 shows us an example of how an ASCII numeric code 92 can be converted back into a backslash character as shown in Figure 1.

Script 1

Figure 1

The backslash character falls into a category of ASCII characters that is known as ASCII Printable Characters – which basically refers to characters visible to the human eye. Table 1 shows a top 5 sample of ASCII Printable Characters.

Numeric Code Character Description
33 ! Exclamation Mark
35 # Number
36 $ Dollar
37 % Percent
38 & Ampersand

Table 1: ASCII Printable Characters (Source: RapidTables.com)

When it comes to addressing data quality issues in SQL Server, it’s easy to clean most of the ASCII Printable Characters by simply applying the REPLACE function. Say for instance that source data contains an email address for John Doe that has several invalid special characters as shown in Script 2.

Script 2

We could eliminate such characters by applying the REPLACE T-SQL function as shown in Script 3.

Script 3

Execution of Script 3 results into a correctly formatted email address that is shown in Figure 2.

Figure 2

Replacing ASCII Control Characters

In addition to ASCII Printable Characters, the ASCII standard further defines a list of special characters collectively known as ASCII Control Characters. Such characters typically are not easy to detect (to the human eye) and thus not easily replaceable using the REPLACE T-SQL function. Table 2 shows a sample list of the ASCII Control Characters.

Numeric Code Character Description
0 NUL null
1 SOH start of header
2 STX start of text
3 ETX end of text
4 EOT end of transmission

Table 2: Top 5 ASCII control characters (Source: RapidTables.com)

To demonstrate the challenge of cleaning up ASCII Control Characters, I have written a C# Console application shown in Script 4 that generates an output.txt text file that contains different variations of John Doe’s email address (only the first line has John Doe’s email address in the correct format).

Script 4

A preview of the output.txt text file populated by Script 4 is shown using the Windows Notepad.exe program in Figure 3.

Figure 3

As it can be seen, there seem to be spaces in email address 2-4 but it’s difficult to tell whether these spaces are created by the Tab character or the Space bar character. Furthermore, if you go back to Script 4, you will recall that for the 3rd email address, I included the start of header character at the end of the email address, but looking at the data in Figure 3, the start of header character is not easily visible at the end of that 3rd email address. In fact, it looks like the email address 3 and 4 have the same amount of characters – which is not true. Only using advanced text editors such as Notepad++ are we then able to visualize the special characters in the data, as shown in Figure 4.

Figure 4

When it comes to SQL Server, the cleaning and removal of ASCII Control Characters are a bit tricky. For instance, say we have successfully imported data from the output.txt text file into a SQL Server database table. If we were to run the REPLACE T-SQL function against the data as we did in Script 3, we can already see in Figure 5 that the REPLACE function was unsuccessful as the length of data in the original column is exactly similar to the length calculated after having applied both REPLACE and TRIM functions.

Script 5

Figure 5

So how do we replace what we cannot see?

  1. Replace String using Character Codes

    The simplest way to replace what we cannot see is that instead of hardcoding the string to replace into our REPLACE function, we should hardcode the string to be replaced by hardcoding its ASCII numerical code within the CHAR function. Thus, instead of providing an exclamation mark as the string to replace, we can hardcode the ASCII numerical code for exclamation mark – which is 33 and convert that numeric code back to character code using the CHAR function. Thus our script changes from:

    To using:

    Script 6

    Now going back to cleaning email address data out of the output.txt text file, we can rewrite our script to what is shown in Script 7.

    Script 7

    After executing Script 7, we can see in Figure 6 that the length of all email address rows matches back to the length of row 1 – which was originally the correct email address. Thus, we have successfully managed to remove “invincible” special characters.

    Figure 6

  2. Dynamically Detect and Replace ASCII Characters

    One noticeable limitation of Script 7 is that we have hard-coded the list of ASCII numerical values. This means if the email address data contained special characters with ASCII numerical value 8 then we wouldn’t have removed them as we had hardcoded our script to specifically look for CHAR(1) and CHAR(9). Therefore, there is a need for a mechanism that allows us to automatically detect ASCII Control Characters contained in a given string and then automatically replace them. Script 8 provides such a mechanism in a form of a While loop within a user-defined function that iteratively searches through a given string to identify and replace ASCII Control Characters.

    Script 8

    The application of the function is shown in Script 9.

    Script 9

Conclusion

Every now and then T-SQL developers are faced with cleaning the data they have imported by usually applying the REPLACE T-SQL function. However, when it comes to removing special characters, removal of ASCII Control Characters can be tricky and frustrating. Fortunately, SQL Server ships with additional built-in functions such as CHAR and ASCII that can assist in automatically detecting and replacing ASCII Control Characters.

References


Sifiso Ndlovu
168 Views