Dejan Sarka

Data understanding and preparation – binning a continuous variable

April 23, 2018 by

I started to explain the data preparation part of a data science project with discrete variables. As you should know by now, discrete variables can be categorical or ordinal. For ordinal, you have to define the order either through the values of the variable or inform about the order the R or the Python execution engine. Let me start this article with Python code that shows another way how to define the order of the Education variable from the dbo.vTargetMail view from the AdventureWorksDW2016 demo database.

This code imports the necessary libraries, reads the data from SQL Server, defines Education as categorical, and then reorder the values using the pandas built-in function cat.reorder_categories(). Then the code shows the distribution of the values and the bar plot for this variable. For the sake of brevity, I am not showing the bar chart, because it is the same as I have shown in my previous article, Data science, data understanding and preparation – ordinal variables and dummies.

Some data science algorithms need only discrete variables for the input. If you have a continuous variable, you have to group or bin the values. Sometimes you want to bin a continuous variable for other reasons as well, for example, to show the distribution in a bar chart, or to use it for grouping and pivoting. Therefore, let me explain a couple of options you have for binning.

Equal width binning

There are many ways to do the binning. I will introduce here the three most popular ones, the equal width, equal height, and custom binning. Let me start with T-SQL code that prepares a new table with the Age variable and the key, Age lowered for 10 years, to make the data more plausible.

Here are the results.

You can see the minimal, maximal and average value of the age, the range, and the width of a bin for equal width binning. Equal width binning means that the width of each bin is equal, no matter of the number of cases in each bin. This kind of binning preserves well the distribution of the continuous variable, especially if the number of bins is high. The following Python code reads the Age from the table just created, and then bins it into 20 equal width bins, and then creates the bar chart.

So here is the bar chart. You can see the distribution of Age quite well.

If you use less bins, then the bar chart follows the shape of the distribution worse. The next Python code bins the values of the Age in 5 bins.

These are the counts, or the frequency distribution of the Age, binned in 5 equal width bins.

In R, you can use the cut() function from the basic installation, without any additional package, to bin the data. The following code loads the RODBC library, reads the SQL Server table, and then bins the Age variable in 5 equal width bins. The counts, the result of the table() function, is the same as the result from the Python code.

Finally, let’s do the equal width binning in T-SQL. The code first reads the minimal and the maximal value of Age and calculated the width of the bins, considering binning into 5 bins. Then the next query uses the CASE clause to define the bins and the labels of the bins.

Again, the distribution of the binned variable is the same as in Python or R. Here are the partial results of the query, showing a couple of rows with the original and the binned values of the Age variable.

Equal height binning

Equal height binning means that after the binning, there is a napproximately equal number of cases in each bin, and the width of the bins varies. You can do this very simple with the T_SQL NTILE() function, like the following code shows. Note that I used the ORDER BY clause only to shuffle the results, to get all possible tiles in the first few rows.

Here are partial results.

In R, I will create a function for the equal height binning. Let me develop the function step by step. First, I use the integer division to calculate the minimal number of cases in each bin for 5 bins.

The number given is 3,696. If the number of cases would be divisible by 5, then each bin would have this number of cases. Let’s create a vector with five times this number.

However, the number of cases in this data frame we are using is not divisible by 5. With the modulo operator, you can see that there is a remainder of 4 rows.

We will add these four cases to the first four bins, like does the NTILE() function in T-SQL. The following code creates a vector of five values, namely (1, 1, 1, 1, 0), which will be added to the number of cases in each bin.

Finally, here is the code that creates the binning function and does the binning.

The table() function shows the number of cases in each bin, 3697, 3697, 3697, 3697, and 3696.

In Python, you can use the pandas qcut() function for the equal height, or quantile binning, like the following code shows.

Note the results – the distribution of the values slightly differs from the results of the T-SQL and R code.

These small differences are due to the fact that the qcut() function classifies all cases with the same value in a single tile. In Python, all cases with Age of 25 are in the first tile; in T-SQL and R, some of these cases were assigned to tile 2, in order to have a really equal number of cases in each tile. Nevertheless, these differences are small and not really important.

Custom binning

If you want to follow the real life or business logic, you need to do a custom binning. The Age variable is a very nice example. Age is many times binned with a custom logic. You can imagine that one-year difference in age means much more when you are a teenager than when you are in sixties or seventies. Therefore, you create narrower bins for the smaller values and wider for the bigger values of the Age variable. Here is the T-SQL code that does the custom binning.

You can do it easily with the pandas cut() function in Python as well. You only need to provide the function a vector with the cutting values.

Here is the graph showing the distribution of the Age binned in custom bins.

Finally, let’s do the custom binning in R. I am using the cut() function again and feed it with a vector of the cutting points, like in Python.

And here are the counts of the binned variable.

Conclusion

Equal width and custom binning are both quite intuitive. You can ask yourself why you would use equal height binning. Of course, there is a reason for this kind of binning as well. You preserve more information with equal height binning than with other two options. I will explain this in detail in my next article.

Table of contents

Introduction to data science, data understanding and preparation
Data science, data understanding and preparation – ordinal variables and dummies
Data science, data understanding and preparation – binning a continuous variable
Data science, data understanding and preparation – entropy of a discrete variable
Data science, data understanding and preparation – basic work with datasets
Data science, data understanding and preparation – grouping and aggregating data I
Data science, data understanding and preparation – grouping and aggregating data II
Interview questions and answers about data science, data understanding and preparation

References


See more

100% FREE SQL tools for SQL coding, refactoring, productivity, formatting, plan analysis, instance discovery, multi-db script propagation, database text and object search, object decryption, SQL CI/CD/DLM/DevOps, and SQL script comparison.


Dejan Sarka

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.

View all posts by Dejan Sarka
Dejan Sarka
Data science

About Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications. Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight. View all posts by Dejan Sarka

116 Views