Matija Lah

String Concatenation Done Right – Part 1 – Dubious Practices

April 15, 2015 by

Having to represent sets of data as strings is a very common requirement in information management, even in modern times where a variety of more or less elaborate standards for storing, and moving, data are at our disposal. For instance, XML, JSON, or similar techniques, allow the data to be extracted from one data source, using a well-known standard, and be stored temporarily until being loaded into a destination data store, or until being consumed in some other way. Actually, both XML as well as JSON might even be used as a standard way of storing data permanently; especially, if the consumers expect the data to use one or the other format.

Depending on the client application, the data retrieved from a database management system, can be transformed in different ways. To some consumers (or destinations) the metadata at the origin is important (for instance, in Microsoft Excel it might be crucial to preserve the information that a particular value is a string representation of a numerical value), whereas to some client applications the metadata as set at the origin is irrelevant (for instance, when an HTML document is rendered, practically all of its contents are treated as strings of characters, regardless of the actual data type or domain used at the source).

What exactly am I talking about? Imagine a database about books and their authors (such as the pubs sample database that came with SQL Server 2000). A book can be written by one or more authors, and each author can write one or more books. If you wanted to create a list of books containing their attributes, including the list of authors as a single table, you could concatenate the full names of the authors into a single string and treat them as a single attribute, as shown in Figure 1 below.

Figure 1: Titles and authors (from the pubs sample database).

Typically, this transformation from metadata-rich data to raw data would be performed by the client application; however, there are cases where the client application is incapable of performing such transformations (for instance, when the client application expects a string representation of date/time values in accordance with a specific standard, different from the one used by the data source), or a client application might not even exist (for instance, when data is exported from a data source, and it is not possible to determine in advance what the metadata requirements of the data destination will be).

Primarily, SQL Server provides a few standard ways to access data. By using the Transact-SQL querying language, the results of the queries can be consumed by a variety of data providers – either as row sets (by using the basic SELECT statement), or even as XML (by using the SELECT statement with the FOR XML directive). It is, however, possible to extend the built-in capabilities with custom programmatic logic.

In this three-part article I will present two reliable, and efficient, techniques for representing data sets as delimited strings. Both techniques will use native SQL Server capabilities that have been available since SQL Server 2005; however, for the purposes of this particular article I will be using SQL Server 2012, which will allow me to significantly simplify one of the techniques.

Now, before you learn about the two good options, I have to point out a couple of inappropriate ones.

Dubious Practices

In Transact-SQL it is possible to use (actually, I should say misuse) a native data retrieval method in order to concatenate a set of values into a single string value. The technique is referred to as variable assignment using the SELECT statement; it relies on the way data is consumed and assigned to a variable when the SELECT statement is executed.

The SELECT statement can be used to assign values to T-SQL variables, as shown in Figure 2 below.

Figure 2: The SELECT statement can be used to assign values to variables.

Note that the above query does not use any restrictions; therefore multiple object_id values will be retrieved, and assigned to the @var variable. Because the variable can only hold a single value, after the rows have been processed only a single assignment will prevail. Relational theory defines the set as an unordered collection of elements; therefore the only way to guarantee order in a retrieval query is to use the ORDER BY clause. The above query does not specify order; therefore the database engine is free to choose any order it deems appropriate to retrieve the data, and perform the assignment. As a consequence, it is impossible to predict which object_id value will be assigned to the variable when the query execution is completed.

The result of a SELECT statement is a set; that is, zero, one or more, values, depending on the source data and the restrictions used in the query. The source of the values can be a column, a variable, an expression, or even a subquery. The query in Figure 3 uses an expression to assign values to the variable, and this expression combines the previously assigned value with the value retrieved from the column.

Figure 3: Expressions can be used to assign values to variables.

In this particular case, where the expression is an addition of numerical values, the order of the assignments is irrelevant; however, in the end the variable will hold a sum of all object_id values.

The query shown in Figure 4 can be used to retrieve all object_id values concatenated into a single string.

Figure 4: Variable assignment using the SELECT statement could, theoretically, be used for string concatenation.

However, this particular syntax still does not guarantee the expected results. The behaviour of variable assignments using the SELECT statement has been a subject of many discussions in the past, and a comprehensive explanation is also available in the Microsoft Knowledge Base. I urge you to read through the article entitled “PRB: Execution Plan and Results of Aggregate Concatenation Queries Depend upon Expression Location”, available at http://support2.microsoft.com/default.aspx?scid=287515, where the problems with the above technique are discussed in more detail.

Most of all, I urge you not to use the variable assignment technique to create string representations of data sets for a very simple reason: the behaviour of the above query is undefined.

With the introduction of XML in SQL Server 2005, a significantly more appropriate alternative has become available. As you surely know, XML has been a native data type since SQL Server 2005; it is a complex data type – supported by a set of built-in retrieval and manipulation methods – that even comes with its own querying language: the XPath expressions and the XML Query (XQuery).

SQL Server also supports XML composition, the ability to create XML documents using T-SQL, which can be used to concatenate a set of values and represent them as a single value – an XML document. For instance, by using an XML composition expression in a nested SELECT statement, it is possible to create a delimited string containing multiple object_id values, as shown in Figure 5 below.

Figure 5: A simple XML composition – simple, yet unsafe.

If the TYPE option is omitted from the FOR XML directive, the composed XML data is returned as text (NVARCHAR(MAX), to be exact). Therefore, while this may seem like the perfect solution, treating XML as if it were just a string of characters might not be a very good idea, as demonstrated in Figure 6 below.

Figure 6: XML is not just a string.

In the result of the query above, the greater-than sign (>) is returned as an HTML entity instead of the literal value. This allows the resulting string to be considered a well-formed representation of an XML document, so that any standard XML parser can convert it to an actual XML document or XML fragment. But would a human have expected to see this?

In part two I will demonstrate how to correctly utilize XML composition to represent data sets as delimited strings. I would now kindly ask you to forget both techniques shown above as soon as possible.

See more

Consider these free tools for SQL Server that improve database developer productivity.


Matija Lah

Matija Lah

Matija Lah, formally a university graduate of law, has been involved in information management since the nineties.

His first introduction to SQL Server in IUS Software d.o.o. Slovenija has later culminated in a new career developing data-centric solutions and consulting. His contributions to the SQL Server community have led to the Microsoft Most Valuable Professional award in 2007 (Windows Server System SQL Server).

Currently, most of Matija's time is spent on projects involving advanced information management, natural language processing, and knowledge management.

View all posts by Matija Lah
Matija Lah
String Concatenation

About Matija Lah

Matija Lah, formally a university graduate of law, has been involved in information management since the nineties. His first introduction to SQL Server in IUS Software d.o.o. Slovenija has later culminated in a new career developing data-centric solutions and consulting. His contributions to the SQL Server community have led to the Microsoft Most Valuable Professional award in 2007 (Windows Server System SQL Server). Currently, most of Matija's time is spent on projects involving advanced information management, natural language processing, and knowledge management. View all posts by Matija Lah

1,317 Views