Dejan Sarka

Data science, data understanding and preparation – ordinal variables and dummies

March 29, 2018 by

In my previous article, Introduction to data science, data understanding and preparation, I showed how to make an overview of a distribution of a discrete variable. I analyzed the NumberCarsOwned variable from the dbo.vTargetMail view that you can find in the AdventureWorksDW2016 demo database. The graphs I created in R and Python and the histogram created with T-SQL were all very nice. Now let me try to create a histogram for another variable from that view, for the Education variable. I am starting with R, as you can see from the following code.

You can see the result in the following figure.

This histogram does not look good. The distribution is somehow weird; you would expect that the distribution would resemble the Gaussian curve. What is the problem? Note the labels – the bars are organized alphabetically. However, Education is not a pure nominal variable; it is ordinal, just like NumberCarsOwned one. It has an intrinsic order. The intrinsic order of the NumberCarsOwned variable is the same as the order of the values of that variable. With Education, you need to define the order in some way, so the values get ordered properly.

I will demonstrate a couple of techniques for defining the order of ordinal variables in T-SQL, R, and Python. In addition, I will show in this article another useful data preparation technique for working with discrete variables – creating dummy variables, or dummies, from a discrete one.

Ordering ordinals

If you want to change the order of the values of a variable in T-SQL, you need to change the values themselves. You can do it with the CASE clause in the SELECT statement you use to read the data. Of course, you can also permanently update the values in a table, and even add a new column with the updated values. The following query shows how you can use the CASE clause to modify the values in a query and get a proper histogram.

You can see the results in the next figure. You see that the order is now correct and that the histogram looks more like the bell curve.

In R, you don’t need to change the values of an ordinal to get the correct order. R defines discrete variables as so-called factors. Factors have levels, which represent the distinct values of the variable. You define the factors with the factor() function. When you define the levels, you also define whether the levels are ordered or not. If the levels are ordered, it is important to write them in the correct order. The following code shows how to use the factor() function to define the levels and their order for the Education variable, and then plots the histogram for this variable again.

You can see the results in the following figure. Now the histogram looks like it should look from the beginning.

Of course, the next is Python. In Python, you can map values to properly ordered values by using a dictionary object. You use the old values for the keys of the dictionary and the properly ordered values for the values of the key-value pairs of the dictionary. The following code reads the data and creates the mapping dictionary object.

Then you replace the original values with the mapped values. You create a graph with the mapped values. However, you use the dictionary keys, the old values, for the labels. You can see the process in the following code.

And here is the result.

Making dummies

Sometimes you want to convert a categorical variable to a set of indicators that just show the presence or the absence of the values of the variable. You create a new indicator for each distinct value or level of the original variable. You assign the value 1 to the indicator for a specific value when the original variable takes that specific value and the value 0 for each other value of the original variable. Such indicators are also called dummy variables, or dummies. There are some algorithms, particularly when you perform regression analysis, that needs only numerical input.

You can easily understand the process of creating the indicators through the following T-SQL code that uses the IIF function.

I ordered the results by the NEWID() function values just to quickly mix the rows, so I could get all possible distinct values of the Education variable in first ten rows. Here are the results.

In R, there is a very convenient way to create dummies with the help of the dummies package, the dummy function. The following code installs the package and creates the dummies for the Education variable. For the sake of brevity, I am not showing the results of the code.

In Python, again you do not need to install any additional package. You already have the get_dummies() method in the pandas package. The following code shows how to create the dummies in Python. I don’t show the results of this code as well, as they are the same as in T-SQL and R.

Conclusion

You can already see that the proper data preparation is not that simple as it might look at the first glimpse. And I am not done even with the discrete variables yet. In the next article, I will show some possibilities for discretization, for transforming a continuous variable to a discrete one.

Previous article in this series:

Next article in this series:

Downloads

References


Dejan Sarka

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.
Dejan Sarka
Python, T-SQL

About Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications. Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.

76 Views