Prashanth Jayaram

SQL string functions for Data Munging (Wrangling)

September 13, 2018 by

In this article, you’ll learn the tips for getting started using SQL string functions for data munging with SQL Server. In many cases, Machine learning outcomes are only as good as the data they’re built on – but the work of preparing data for analytics (that is, data wrangling) can eat up as much as 80% of your project efforts. 

In this guide, we’ll see the following topics: 

  • What is data munging?
  • How you can reduce your data preparation time
  • How to easily get started with SQL string functions
  • How to process data using SQL string functions
  • And more…

We’ll look at specific SQL string function examples including

  • SQL concatenate string
  • SQL Server substring functions
  • SQL string functions
  • SQL Server convert string to date
  • SQL replace string
  • SQL convert INT to String
  • SQL convert String to DateTime
  • SQL string comparison
  • And more …

Data Munging

Data munging (or Wrangling) is the process of data transformation into various states so that it is easier to work and understand the data. The transformation may lead to manually convert or modify or merge the data in a certain format to generate well-defined streams of data which is ready for consumption by the data analysis tools and techniques.

The various data-sources are

  1. Metrics can be business data or KPI or collection of data within the sub-systems
  2. The use of existing data set for prediction
  3. Use of APIs to download the data
  4. Scraping the website data
  5. Creating data using third-party tools

You don’t have to be working in data science very long before you discover the importance of SQL. We can refer to the Kdnugget Software Poll, the top analytics, data mining, and data science software used in 2015 and look at the SQL’s place. In the survey of data professionals, SQL is placed third in terms of its usage. It’s also the first database tool on the list.  Now, you see that R is right at the top.

Note: The above picture is a reference from the following website. www.kdnuggets.com

Getting started

Let now take a deep dive into SQL string functions to see the different phases of data munging

SQL is further classified into Data Manipulation Language (DML) and Data Definition Language (DDL). These commands are used to work with data-sets more efficiently. We’ll take a look at DML commands at later part of the article.

Now, let’s discuss and analyze some of the SQL commands and understand why data-scientist needs to know these commands to do their work efficiently.  In most cases, a majority portion of their work is about data gathering, data preparation, data cleaning, and data restructuring. After the data preparation phase, the scientist can move forward with the data analysis. In some scenarios, it’s been assumed that about 70% to 80% of the time on the data science project is spent on data manipulation; if this is the case then most of that time is spent working with SQL queries.

The data cleansing is an art in data science; we often tend to collect data from multiple data sources. Many times the same data is stored differently in multiple systems. Let us classify the data munging process into the following categories:

  1. Data reformatting
  2. Data extracting
  3. Data filtering
  4. Data converting

Data refactoring

As a basic principle, whenever we start working with a new data set, it is recommended to spend more time to understand the type and nature of the data.

For example, one couple of data-sets may use abbreviations for departments and in some other datasets it may spell out the full name. We need to reformat data to get it into a consistent format. 

Character Description Example
ASCII The ASCII() SQL String function servers as a characters encoding standard format. In the following example, the ASCII values are returned for the given input
CHAR The CHAR() SQL string function converts an int ASCII code to a character value. Use CHAR to insert control characters into character strings. The control character are:
  • Tab is char(9)
  • Line feed is char(10)
  • Carriage return is char(13)
UPPER The UPPER() SQL string function is used to convert the lower case to upper case of the given string Convert the text ‘SQL Server 2017’ to upper-case
LOWER The LOWER() SQL string function is used to convert the upper case to lower case of the given string Convert the text ‘SQL Server 2017’ to lower-case Output: SQL Server 2017
CONCAT The CONCAT() SQL string function is used to concatenate two or more string  Concatenate strings ‘SQL Shack’ and ‘2018’ to SQL Shack 2018. Output: SQL Shack2018 2019 2020
DISTINCT The DISTINCT keyword is used to eliminates duplicate records from the SQL results
TRIM The TRIM() SQL string function is used to remove blanks from the leading and trailing position of the given string In the following example removes spaces from before and after the word SQL Server 2017. Output: SQL Server 2017
LTRIM LTRIM() SQL string function is used to remove the leading blanks from a given string In the following example removes the leading spaces from the word SQL Server 2017 Output: SQL Server 2017
RTRIM RTRIM() SQL string function is used to remove trailing blanks from a given string In the following example removes the trailing spaces from the word SQL Server 2017 Output: SQL Server 2017
RIGHT The RIGHT() SQL string function is used to return a specified number of characters from the right side of the given string The following example returns the 4 rightmost characters of the word SQL Server 2017. Output: 2017
LEFT The LEFT() SQL string function is used to return a specified number of characters from the left side of the given string The following example returns the 10 leftmost characters of the word SQL Server 2017. Output: SQL Server
REPLACE The REPLACE() SQL string function used to replace all the occurrences of a source string with a target string of a given string Replaces the string ’vNext’ with ‘2018’. Output: SQL Server 2017
REPLICATE The REPLICATE() SQL string function is used repeat the given string into the specified number of times The following example, the string ‘SQL Shack The string ‘SQL Shack Author’ is repeated 5 times The following example replicates a 0 character four times in front of a text
SPACE The SPACE() SQL string function replicates the number blanks that we want add to the given string The following example concatenates ‘SQL Shack’, two spaces and the word ‘2018’ Output: SQL Shack 2018
TRANSLATE The TRANSLATE() SQL string function is used to perform a one-to-one, single-character substitution of a given string In the following example the string replacement is performed using translate function. You can find more information here
REVERSE The REVERSE() SQL string function is used to get a mirror image of the given string The following example returns the word with the characters reversed. Output:
FORMAT The FORMAT() SQL string function is used to return specified formatted value The FORMAT SQL string function introduced in SQL SERVER 2012. It returns the value to format in by specified format and optional culture in SQL Server 2017. Examples for Date and Time formats Examples for Currency Example for percentage
CONCAT_WS The CONCAT_WS() SQL string function is a Concatenate with Separator and is a special form of CONCAT() Concat_WS() emulates the behavior of stuff and Coalesce function. In the following example, ‘-‘ is the delimiter specified in the first argument followed by firstname, Middle name and lastname. The output concatenates three columns from the Person table separating the value with a ‘-‘ Output: You can find more information here

Data Extracting

In addition to matching and reformatting strings, we sometimes need to take them apart and extract pieces of stings. SQL Server provides some general purpose SQL string functions for extracting and overriding strings. Let’s start with a simple string that’s easy to experiment

Character Description Usage
LEN The LEN() SQL string function is used to determine the length of the given string excluding the trailing blanks The following example selects the number of characters with an exclusion of the trailing spaces.
DATALENGTH The DATALENGTH() SQL string function excludes the trailing blanks in a given string. If this is a problem, then use DATALENGTH SQL string function which includes the trailing blanks. In this example, the trailing blanks are also considered while evaluating string length.
CHARINDEX The CHARINDEX() SQL string function is used to return the location of a substring in a given string. In the following example, the starting position ‘Shack’ of the first expression will be returned.
PATINDEX The PATINDEX() SQL string function is used to get the starting position of the first occurrence of given pattern in a specified expression In the following examples, ‘%’ and ‘_’ wildcard characters are used to fin the position of the pattern in a given expression. PATINDEX works just like LIKE operator but it returns the matching position.
SUBSTRING The SUBSTRING() SQL string function is used to returns a specific portion of a given string The following query returns only the part of an input character string. In this example, Example 1: In the below example, a part of given string “Prashanth Jayaram” is extracted dynamically using LEN and SQL Server SUBSTRING functions. Example 2: In the example, we can see the extraction of ObjectID from the given string
STUFF The STUFF() SQL string function is used to place a string within another string The following example returns a character string created by inserting a word Demo at the starting position 5 without deleting any letters from the given string ‘SQL Shack’ The following example returns a character string created by deleting 6 characters from the first string, SQL Shack, starting at position 5, at Shack, and inserting the second string ‘Demo’ at the deletion point.
STRING_AGG The STRING_AGG() SQL string function is the an aggregate function used to compute a single result from a set of input values The following example returns the names separated by ‘-‘ in a single result set. You can find more information here
STRING_SPLIT The STRING_SPLIT() SQL string function is used to splits the input string by a specified separation character and returns the output split values in the form of table SQL Server 2016 introduced a new STRING_SPLIT table-valued function. In an earlier version, we used to write function, CLR code to decode the values.

Data conversion

Sometimes we will need to reformat numbers. This is especially true when we use calculations that have results with large numbers of decimal digits

CONVERT The CONVERT() SQL string function is used to convert an expression from one data type to another data type. The CONVERT SQL string function accepts in a style parameter which is used for formatting the SQL output Implicit conversions do not require either the CAST function or the CONVERT function. Only explicit conversions require specification of the CAST or the CONVERT function. When converting a value from float or numeric to an integer, the CONVERT() SQL string function will truncate the result. For other conversions, the CONVERT() SQL string function will round the result. You can find more information here
CAST The CAST() SQL string function is used to convert an expression from one data type to another data type. Convert an expression from valid date string to DateTime. In the following example, the DateTime type is converted to another varchar type.
STR The STR() SQL string function converts a numeric value into a string The following example converts an integer to character string and concatenate the value with the first string.

Data filtering

The use of the WHERE and HAVING clauses in a SELECT statement control the subset of the output from the source tables.

  1. Compare search conditions using comparison operators
  2. Range search using between, and, and or clause
  3. List search using in operator and or clause

Regular expression search is based on patterns, wildcards, and special characters.  Even if you never ever write CLR, regex (as it’s also known) can be useful to you today, right now. Open a new query window in SQL Server Management Studio.

The regular expression transformation exposes the power of regular expression matching within the pipeline. One or more columns can be selected, and for each column an individual expression can be applied. The way multiple columns are handled can be set on the options page. The AND option means all columns must match, whilst the OR option means only one column has to match. If rows pass their tests then rows are passed down the successful match output. Rows that fail are directed down the alternate output

Character Description Usage
% Matches a string of zero or more characters. The following example returns the any values that matches the string ‘Kim’
Underscore (_) To Match a single character of given string The following example returns the any values whose first characters is unknown followed by ‘im’
[ …] Matches any character within a specified range The following example returns the values whose first characters is in the range A to S followed by ‘im’
[^…] Matches any character not within a specified range The following example returns the values whose first characters are unknown; second-and-third character is ‘im’ and fourth character that matches not within l to Z range.

Summary

This article is an effort to showcase the available SQL string functions to manipulate the raw data to make it more meaningful data-set for the data scientist to perform the data analysis using SQL Server. Hopefully these SQL string functions can save you both time and money!


Prashanth Jayaram

Prashanth Jayaram

I’m a Database technologist having 11+ years of rich, hands-on experience on Database technologies. I am Microsoft Certified Professional and backed with a Degree in Master of Computer Application.

My specialty lies in designing & implementing High availability solutions and cross-platform DB Migration. The technologies currently working on are SQL Server, PowerShell, Oracle and MongoDB.

View all posts by Prashanth Jayaram
Prashanth Jayaram

Latest posts by Prashanth Jayaram (see all)

SQL commands, String functions

About Prashanth Jayaram

I’m a Database technologist having 11+ years of rich, hands-on experience on Database technologies. I am Microsoft Certified Professional and backed with a Degree in Master of Computer Application. My specialty lies in designing & implementing High availability solutions and cross-platform DB Migration. The technologies currently working on are SQL Server, PowerShell, Oracle and MongoDB. View all posts by Prashanth Jayaram

418 Views