Dinesh Asanka
Dependency network.

Microsoft Linear Regression in SQL Server

March 24, 2020 by

In this article, we will be discussing Microsoft Linear Regression in SQL Server. This is the next data mining topic in our SQL Server Data mining techniques series. Naïve Bayes, Decision Trees, Time Series, Association Rules, and Clustering are the other techniques that we discussed until today.

Microsoft Linear Regression is a forecasting technique. In this type of technique, there are multiple independent variables from which the dependent variable is predicted. For example, if you want to predict the house prices, you need to know the number of rooms, the area of the house, and other features of the house.

This means that the linear regression model can be represented as follows:

Y = a X1 + b X2 + . . . + z Xn + C

Let us see how we can use linear regression in the Microsoft SQL Server platform. As in the previous examples, today also, we will be using the vTargetMail view in the AdventureWorksDW sample database.

As we did for other data mining techniques, first, we need to create a data source and the Data Source View. The Data source is chosen as AdventureWorksDW and vTargetMail view is selected as the data source views.

We choose the Microsoft Linear Regression as the data mining technique, as shown in the below screenshot.

Choosing Microsoft Liner Regression as the data mining technique.

In this technique, the Microsoft decision trees algorithm is used. Unlike in the decision trees, linear regression will have only one node, and we will verify the results for linear regression with the decision trees at the end of the article.

The vTargetMail will be the Case table and let us choose relevant attributes, as shown in the below screenshot.

Specify the Input and predicted attributes for Linear regression.

The Customer Key is chosen as the Key from the algorithm from the above screen. In Microsoft Linear regression, all the inputs should be numerical; the text column should not be selected. Therefore, in the above selection, Age, BikeBuyer, HouseOwnerFlag, NumberCarsOwned, NumberChildrenatHome, TotalChildren are selected as input attributes. This is a major limitation in the Microsoft Linear Regression, which is not in the standard Linear Regression techniques.

In the previous examples, we have selected Bike Buyer as the predicted column. However, in the Microsoft Linear Regression, we are to predict YearlyIncome.

Though there are default Content types, there are instances where you need to change the content types. Content types can be modified from the following screenshot.

Modifying the Data type depending on the need.

By default, House Owner Flag is selected Text data type, which has to be changed to the Long data type.

In the other screens in the data mining wizard, default settings are used. This is the Solution Explorer for the Microsoft Linear Regression data mining technique.

Microsoft Linear Regression data mining technique.

The next is to process the data mining structure. There will be a warning message saying that there is no split in decision trees. You can ignore this warning as for linear regression; there won’t be any split for the decision trees.

After processing the data mining structure, we are now ready to view the results.

Viewing the Results

As we observed in many SQL Server algorithms, in linear regression, we can find the dependency network, as shown in the below screenshot.

Dependency network.

The Dependency network shows what the most dependent attributes to predict Yearly Income is. By sliding the slider down on the left-hand side, you can find out the significance of these attributes, as we observed in the Naïve base and Decision Trees.

In Microsoft Linear Regression, only another available view is Tree View. However, as indicated before, it is a one-node tree view.

Tree view in Linear Regression with one node.

From this view, you can get the linear regression equation, which is the final goal of this technique.

The following screenshot shows the linear regression equation.

Linear regression equation with different coeffcient.

This is the equation and you simply have to replace relevant values to predict the yearly income.

Yearly Income = 57,308.498

– 1,940.134*(Total Children-1.844)

+ 8,988.432*(Bike Buyer-0.494)

+9,468.323*(Number Cars Owned-1.503)

+416.002*(Age-50.380)

+7,818.527*(Number Children At Home-1.004)

Let us see how we can predict from the built model using prediction feature.

Prediction

An important aspect of any data mining technique is to predict using the built model. Let us see how we can predict using the built Microsoft Linear Regression model. This can be done from the Mining Model Prediction tab, as shown in the below screenshot. In the following example, some values are provided for a given instance to predict the annual income.

Prediction using built Linear Regression Model.

From the result tab, results can be views as shown in the below screenshot.

Predicted value and it's probability.

The same results can be obtained from SQL Server Management Studio by executing the DMX query. The following screenshot shows the query and its result.

Prediction by using Microsfot Linear Regression using DMX query.

It is important to note that if you do not have some attributes, you can still obtain the results. The following screenshot shows the prediction value from the Linear Regression model when the number of cars is unknown.

Prediction of YearIyIncome without No of cars parameter.

When an attribute is missing, that attribute’s part will be ignored from the entire equation.

Let us validate the equation with the Decision Tree technique.

Validation with Decision Tree

Adding another data mining technique in SQL Server is much simpler. You can add another mining model to the existing attributes in the Mining Model tab.

Adding decision trees to the Mining Model.

After processing the mining structure, you will observe the decision, as shown below.

Decision Trees for Regression.

Let us look at the equation at the main node.

Yearly Income = 57,308.215 + 9,468.574*(Number Cars Owned-1.503) + 415.816*(Age-50.384) + 8,988.666*(Bike Buyer-0.494) + 7,817.585*(Number Children At Home-1.004) – 1,939.209*(Total Children-1.844)

You will find that it is the same equation that you got from the linear regression.

More than that equation, the decision tree has the additional advantage of having node wise equation. In the decision trees, if you click every node, you find an equation, as shown in the below screenshot.

Node wise equations for Decision trees.

This means that equation Yearly Income will be as 63,702.121-42,567.214*(House Owner Flag-0.733)+36,322.044*(Bike Buyer-0.500) is valid for Age >= 41 and < 49 and Number of cars owned = 2 and Total children = 1.

The following table shows the different equations at different nodes in the tree.

Data Set

Equation

Age >= 73 and < 81 and Total Children = 3

Yearly Income = 56,936.254-4,193.080*(Bike Buyer-0.121)-20,137.503*(Number Cars Owned-1.994)-1,936.065*(Age-75.146)

Total Children = 3 and Age = 76

Yearly Income = 58,000.000-8,884.447*(Bike Buyer-0.100)

Age = 73 and Total Children = 3

Yearly Income = 56,998.501+4,498.500*(Bike Buyer-0.333)

Age >= 73 and < 81 and Total Children = 2 and Number Children At Home = 3

Yearly Income = 121,037.417+2,108.061*(Age-75.667)+14,848.268*(Bike Buyer-0.333)

Age >= 49 and < 51 and Total Children >= 4 and Number Cars Owned = 2 and Number Children At Home < 3

Yearly Income = 62,583.618-18,114.343*(Number Children At Home-1.897)+5,525.516*(Bike Buyer-0.793)+6,861.981*(Age-49.759)-14,461.923*(Total Children-4.017)

This means that decision trees are more accurate than Microsoft linear regression.

Model Parameters

As we discussed, every data mining technique has its parameters to suit your data and environments.

Defining Model Parameters

FORCE_REGRESSOR

Microsoft Linear Regression algorithm detects best-fitted attributes automatically and generates the linear equation. In this attempt, it might drop some attributes. However, you can force any attribute that you wish to include in the equation by including in the FORCE_REGRESSOR parameters. If there are multiple attributes, you can include all attributes, such as {Attribute 1}, {Attribute 2}.

Summary

In this article, we discussed Linear regression as a forecasting technique. Microsoft Linear Regression technique was built on the Decision trees and we identified that the decision trees could be used as a regression technique as well.

Table of contents

Introduction to SQL Server Data Mining
Naive Bayes Prediction in SQL Server
Microsoft Decision Trees in SQL Server
Microsoft Time Series in SQL Server
Association Rule Mining in SQL Server
Microsoft Clustering in SQL Server
Microsoft Linear Regression in SQL Server
Implement Artificial Neural Networks (ANNs) in SQL Server
Implementing Sequence Clustering in SQL Server
Measuring the Accuracy in Data Mining in SQL Server
Data Mining Query in SSIS
Text Mining in SQL Server
Dinesh Asanka
168 Views