One of the important steps in an ETL process involves the transformation of source data. This could involve looking up foreign keys, converting values from one data type into another, or simply conducting data clean-ups by removing trailing and leading spaces. One aspect of transforming source data that could get complicated relates to the removal of ASCII special characters such as new line characters and the horizontal tab. In this article, we take a look at some of the issues you are likely to encounter when cleaning up source data that contains ASCII special characters and we also look at the user-defined function that could be applied to successfully remove such characters.
Replacing ASCII Printable Characters
The American Standard Code for Information Interchange (ASCII) is one of the generally accepted standardized numeric codes for representing character data in a computer. For instance, the ASCII numeric code associated with the backslash (\) character is 92. Many of the software vendors abide by ASCII and thus represents character codes according to the ASCII standard. Likewise, SQL Server, which uses ANSI – an improved version of ASCII, ships with a built-in CHAR function that can be used to convert an ASCII numerical code back to its original character code (or symbol). Script 1 shows us an example of how an ASCII numeric code 92 can be converted back into a backslash character as shown in Figure 1.
The backslash character falls into a category of ASCII characters that is known as ASCII Printable Characters – which basically refers to characters visible to the human eye. Table 1 shows a top 5 sample of ASCII Printable Characters.
When it comes to addressing data quality issues in SQL Server, it’s easy to clean most of the ASCII Printable Characters by simply applying the REPLACE function. Say for instance that source data contains an email address for John Doe that has several invalid special characters as shown in Script 2.
DECLARE @email VARCHAR(55) = 'johndoe@a!b#c.com$';
We could eliminate such characters by applying the REPLACE T-SQL function as shown in Script 3.
SELECT REPLACE(REPLACE(REPLACE(@email, '!', ''), '#', ''), '$', '');
Execution of Script 3 results into a correctly formatted email address that is shown in Figure 2.
Replacing ASCII Control Characters
In addition to ASCII Printable Characters, the ASCII standard further defines a list of special characters collectively known as ASCII Control Characters. Such characters typically are not easy to detect (to the human eye) and thus not easily replaceable using the REPLACE T-SQL function. Table 2 shows a sample list of the ASCII Control Characters.
|1||SOH||start of header|
|2||STX||start of text|
|3||ETX||end of text|
|4||EOT||end of transmission|
To demonstrate the challenge of cleaning up ASCII Control Characters, I have written a C# Console application shown in Script 4 that generates an output.txt text file that contains different variations of John Doe’s email address (only the first line has John Doe’s email address in the correct format).
using (StreamWriter writer = new StreamWriter(@"C:\temp\output.txt"))
string vd = "email@example.com";
writer.WriteLine((char)1 + vd);
writer.WriteLine((char)9 + vd + (char)1);
writer.WriteLine((char)9 + vd);
A preview of the output.txt text file populated by Script 4 is shown using the Windows Notepad.exe program in Figure 3.
As it can be seen, there seem to be spaces in email address 2-4 but it’s difficult to tell whether these spaces are created by the Tab character or the Space bar character. Furthermore, if you go back to Script 4, you will recall that for the 3rd email address, I included the start of header character at the end of the email address, but looking at the data in Figure 3, the start of header character is not easily visible at the end of that 3rd email address. In fact, it looks like the email address 3 and 4 have the same amount of characters – which is not true. Only using advanced text editors such as Notepad++ are we then able to visualize the special characters in the data, as shown in Figure 4.
When it comes to SQL Server, the cleaning and removal of ASCII Control Characters are a bit tricky. For instance, say we have successfully imported data from the output.txt text file into a SQL Server database table. If we were to run the REPLACE T-SQL function against the data as we did in Script 3, we can already see in Figure 5 that the REPLACE function was unsuccessful as the length of data in the original column is exactly similar to the length calculated after having applied both REPLACE and TRIM functions.
LEN([Column 0]) OriginalLength,
LEN(REPLACE(REPLACE(LTRIM(LTRIM([Column 0])), ' ', ''), ' ', '')) NewLength
FROM [SQLShack].[dbo].[OLE DB Destination];
So how do we replace what we cannot see?
Replace String using Character Codes
The simplest way to replace what we cannot see is that instead of hardcoding the string to replace into our REPLACE function, we should hardcode the string to be replaced by hardcoding its ASCII numerical code within the CHAR function. Thus, instead of providing an exclamation mark as the string to replace, we can hardcode the ASCII numerical code for exclamation mark – which is 33 and convert that numeric code back to character code using the CHAR function. Thus our script changes from:12DECLARE @email VARCHAR(55)= 'johndoe@a!bc.com';SELECT REPLACE(@email, '!', '');
To using:12DECLARE @email VARCHAR(55)= 'johndoe@a!bc.com';SELECT REPLACE(@email, CHAR(33), '');
Now going back to cleaning email address data out of the output.txt text file, we can rewrite our script to what is shown in Script 7.12345SELECT [id],[Column 0],LEN([Column 0]) OriginalLength,LEN(REPLACE(REPLACE([Column 0], CHAR(1), ''), CHAR(9), '')) NewLengthFROM [SQLShack].[dbo].[OLE DB Destination];
After executing Script 7, we can see in Figure 6 that the length of all email address rows matches back to the length of row 1 – which was originally the correct email address. Thus, we have successfully managed to remove “invincible” special characters.
Dynamically Detect and Replace ASCII Characters
One noticeable limitation of Script 7 is that we have hard-coded the list of ASCII numerical values. This means if the email address data contained special characters with ASCII numerical value 8 then we wouldn’t have removed them as we had hardcoded our script to specifically look for CHAR(1) and CHAR(9). Therefore, there is a need for a mechanism that allows us to automatically detect ASCII Control Characters contained in a given string and then automatically replace them. Script 8 provides such a mechanism in a form of a While loop within a user-defined function that iteratively searches through a given string to identify and replace ASCII Control Characters.123456789101112131415161718CREATE FUNCTION [dbo].[ReplaceASCII](@inputString VARCHAR(8000))RETURNS VARCHAR(55)ASBEGINDECLARE @badStrings VARCHAR(100);DECLARE @increment INT= 1;WHILE @increment <= DATALENGTH(@inputString)BEGINIF(ASCII(SUBSTRING(@inputString, @increment, 1)) < 33)BEGINSET @badStrings = CHAR(ASCII(SUBSTRING(@inputString, @increment, 1)));SET @inputString = REPLACE(@inputString, @badStrings, '');END;SET @increment = @increment + 1;END;RETURN @inputString;END;GO
The application of the function is shown in Script 9.12345SELECT [id],[Column 0],LEN([Column 0]) OriginalLength,LEN([SQLShack].[dbo].[ReplaceASCII]([Column 0])) NewLengthFROM [SQLShack].[dbo].[OLE DB Destination];
Every now and then T-SQL developers are faced with cleaning the data they have imported by usually applying the REPLACE T-SQL function. However, when it comes to removing special characters, removal of ASCII Control Characters can be tricky and frustrating. Fortunately, SQL Server ships with additional built-in functions such as CHAR and ASCII that can assist in automatically detecting and replacing ASCII Control Characters.
He is the member of the Johannesburg SQL User Group and also hold a Master’s Degree in MCom IT Management from the University of Johannesburg.
He currently works for Clientele Life as an Assistant Manager in Business Software Solutions.
View all posts by Sifiso W. Ndlovu