In this article, you’ll learn the tips for getting started using SQL string functions for data munging with SQL Server. In many cases, Machine learning outcomes are only as good as the data they’re built on – but the work of preparing data for analytics (that is, data wrangling) can eat up as much as 80% of your project efforts.
In this guide, we’ll see the following topics:
What is data munging?
How you can reduce your data preparation time
How to easily get started with SQL string functions
How to process data using SQL string functions
We’ll look at specific SQL string function examples including
SQL concatenate string
SQL Server substring functions
SQL string functions
SQL Server convert string to date
SQL replace string
SQL convert INT to String
SQL convert String to DateTime
SQL string comparison
And more …
Data munging (or Wrangling) is the process of data transformation into various states so that it is easier to work and understand the data. The transformation may lead to manually convert or modify or merge the data in a certain format to generate well-defined streams of data which is ready for consumption by the data analysis tools and techniques.
The various data-sources are
Metrics can be business data or KPI or collection of data within the sub-systems
The use of existing data set for prediction
Use of APIs to download the data
Scraping the website data
Creating data using third-party tools
You don’t have to be working in data science very long before you discover the importance of SQL. We can refer to the Kdnugget Software Poll, the top analytics, data mining, and data science software used in 2015 and look at the SQL’s place. In the survey of data professionals, SQL is placed third in terms of its usage. It’s also the first database tool on the list. Now, you see that R is right at the top.
Let now take a deep dive into SQL string functions to see the different phases of data munging
SQL is further classified into Data Manipulation Language (DML) and Data Definition Language (DDL). These commands are used to work with data-sets more efficiently. We’ll take a look at DML commands at later part of the article.
Now, let’s discuss and analyze some of the SQL commands and understand why data-scientist needs to know these commands to do their work efficiently. In most cases, a majority portion of their work is about data gathering, data preparation, data cleaning, and data restructuring. After the data preparation phase, the scientist can move forward with the data analysis. In some scenarios, it’s been assumed that about 70% to 80% of the time on the data science project is spent on data manipulation; if this is the case then most of that time is spent working with SQL queries.
The data cleansing is an art in data science; we often tend to collect data from multiple data sources. Many times the same data is stored differently in multiple systems. Let us classify the data munging process into the following categories:
As a basic principle, whenever we start working with a new data set, it is recommended to spend more time to understand the type and nature of the data.
For example, one couple of data-sets may use abbreviations for departments and in some other datasets it may spell out the full name. We need to reformat data to get it into a consistent format.
The ASCII() SQL String function servers as a characters encoding standard format.
In the following example, the ASCII values are returned for the given input
The CONCAT_WS() SQL string function is a Concatenate with Separator and is a special form of CONCAT()
Concat_WS() emulates the behavior of stuff and Coalesce function. In the following example, ‘-‘ is the delimiter specified in the first argument followed by firstname, Middle name and lastname. The output concatenates three columns from the Person table separating the value with a ‘-‘
In addition to matching and reformatting strings, we sometimes need to take them apart and extract pieces of stings. SQL Server provides some general purpose SQL string functions for extracting and overriding strings. Let’s start with a simple string that’s easy to experiment
The LEN() SQL string function is used to determine the length of the given string excluding the trailing blanks
The following example selects the number of characters with an exclusion of the trailing spaces.
SELECTLEN('SQL Server 2017')
SELECTLEN('SQL Server 2017 ')
The DATALENGTH() SQL string function excludes the trailing blanks in a given string. If this is a problem, then use DATALENGTH SQL string function which includes the trailing blanks.
In this example, the trailing blanks are also considered while evaluating string length.
selectDATALENGTH('SQL Server 2017')
selectDATALENGTH('SQL Server 2017 ')
The CHARINDEX() SQL string function is used to return the location of a substring in a given string.
In the following example, the starting position ‘Shack’ of the first expression will be returned.
The PATINDEX() SQL string function is used to get the starting position of the first occurrence of given pattern in a specified expression
In the following examples, ‘%’ and ‘_’ wildcard characters are used to fin the position of the pattern in a given expression.
PATINDEX works just like LIKE operator but it returns the matching position.
The STUFF() SQL string function is used to place a string within another string
The following example returns a character string created by inserting a word Demo at the starting position 5 without deleting any letters from the given string ‘SQL Shack’
SELECTSTUFF('SQL Shack ',5,0,'Demo')
The following example returns a character string created by deleting 6 characters from the first string, SQL Shack, starting at position 5, at Shack, and inserting the second string ‘Demo’ at the deletion point.
SELECTSTUFF('SQL Shack ',5,6,'Demo')
The STRING_AGG() SQL string function is the an aggregate function used to compute a single result from a set of input values
The following example returns the names separated by ‘-‘ in a single result set.
Sometimes we will need to reformat numbers. This is especially true when we use calculations that have results with large numbers of decimal digits
The CONVERT() SQL string function is used to convert an expression from one data type to another data type.
The CONVERT SQL string function accepts in a style parameter which is used for formatting the SQL output
Implicit conversions do not require either the CAST function or the CONVERT function. Only explicit conversions require specification of the CAST or the CONVERT function.
When converting a value from float or numeric to an integer, the CONVERT() SQL string function will truncate the result. For other conversions, the CONVERT() SQL string function will round the result.
The CAST() SQL string function is used to convert an expression from one data type to another data type.
Convert an expression from valid date string to DateTime.
In the following example, the DateTime type is converted to another varchar type.
The STR() SQL string function converts a numeric value into a string
The following example converts an integer to character string and concatenate the value with the first string.
selectTRIM('SQL Shack Author of the year')+STR(2018)
The use of the WHERE and HAVING clauses in a SELECT statement control the subset of the output from the source tables.
Compare search conditions using comparison operators
Range search using between, and, and or clause
List search using in operator and or clause
Regular expression search is based on patterns, wildcards, and special characters. Even if you never ever write CLR, regex (as it’s also known) can be useful to you today, right now. Open a new query window in SQL Server Management Studio.
The regular expression transformation exposes the power of regular expression matching within the pipeline. One or more columns can be selected, and for each column an individual expression can be applied. The way multiple columns are handled can be set on the options page. The AND option means all columns must match, whilst the OR option means only one column has to match. If rows pass their tests then rows are passed down the successful match output. Rows that fail are directed down the alternate output
Matches a string of zero or more characters.
The following example returns the any values that matches the string ‘Kim’
This article is an effort to showcase the available SQL string functions to manipulate the raw data to make it more meaningful data-set for the data scientist to perform the data analysis using SQL Server. Hopefully these SQL string functions can save you both time and money!