Prashanth Jayaram

Data Interpolation and Transformation using Python in SQL Server 2017

November 21, 2017 by

As a continuation to my previous article, How to use Python in SQL Server 2017 to obtain advanced data analytics, a little bit of curiosity about Deep Learning with Python integration in SQL Server led me to write this latest article.

With Python running within SQL Server, you can bring the existing data and the code together. Data is accessible directly, so there’s no need to extract query data sets, moving data from storage to the application. It’s a useful approach, especially considering issues of data sovereignty and compliance, since the code runs within the SQL Server security boundaries, triggered by a single call from T-SQL stored procedures.

As we’re all aware, Microsoft is now taking some steps that had us surprised, one of them being the release of SQL Server 2017 to Linux. In an attempt to expand the horizons, Microsoft has brought in Python capabilities within SQL Server. Python is a language that is easily learned and it packs a lot of potentials. There’s no better time to learn Python, since enterprises are already changing gears to use IT to better derive value from their businesses data.

Machine learning and artificial intelligence (AI) may sound intimidating but actually, it’s a great value-add to organizations in the areas such as web search, financial prediction, fraud detection, digitization etc. There is a lot of hype around deep learning and AI, and the question, whether to use it or not to use it but one thing we can all agree upon is that analytics is of a lot of value to businesses.

It’s really hard to say at the initial stage how well this integration with SQL Server would be or how well SQL Server can withstand the data science design and methodology, but it sure is an interesting feature and a highly rated data science product to be tested in SQL Server 2017.

The available options made me execute several Python samples and create a few data mining examples to understand the core concepts of data analytics. As I mentioned in my previous article How to use Python in SQL Server 2017 to obtain advanced data analytics, it’s all about data loading and data transformation. If the model is huge, one may have a hard time loading the data and transferring it over the network. Hence, data scientists do their predictive analysis using the sampling method.

If you are a database administrator interested in leveraging data science, your first question would be, “Where do I start and how?”

Where to begin

As a database administrator, the thought of concepts of Data Science may seem overwhelming. Worry not; we’ll start small with simple examples. Let’s pick interpolation. Interpolation is like filling in the blanks, in a series. We look at the data surrounding the blank and predict what might be the right data to fill in. This requires some meaningful analysis of the context of the data.

In this article, we’re going to try some interpolation and transformation operations using Python, which covers:

  • Demonstration of the execution of a Python script in SQL Server
  • Importing modules and loading data into the dataset using the Python script
  • Data aggregation using Python nodules
  • Working with JSON files
  • Pivoting SQL data
  • And more….

Getting started

In this article, we’re going to use a SQL table called “Loan Prediction”. Let’s say that this table already has some data in it.

We first need to run some data munging operations on it. Data Munging is a process of exploration of data, identifying issues with it, and fixing the same before it can be used as a model. Next, we need to validate and build relationships between attributes.

Let us start the demo by meeting the prerequisites of the data cleansing and loading.

Data loading

Create the table tbl_loanPrediction


Copy sample data into the CSV file

Let’s create a CSV file with the name loanprediction.csv and add the following content to it. Let’s place the file at E:\ loanprediction.csv.

Loading CSV file into the SQL Table

Using BULK INSERT, load the data into the tbl_loanprediction table


Querying the CSV file using PANDAS

PANDAS is an open source, high performance data analysis library for Python


Data extraction using logical condition

Let us now add some logical conditions and extract some data.


Data loading into the Temp table for further processing

In this script, we will add the data to a new table. We will be using the Insert and Exec commands together to fetch the values from the data frames and send them to the new table.

We will use an additional parameter, “@output_data_1_name” to hold the output data that will be used by the Insert command in order to update the newly created table.

You may want to filter values of a column based on conditions from another set of columns. For instance, we want a list of all females who are not graduate and are married:


Pivoting

PANDAS can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “LoanAmount” which has missing values. We can compute it using the mean amount of each of “Gender”, “Married” and “Self_Employed” groups. The mean “LoanAmount” of each group can be determined as:


Let us see another example of pivoting. In the below example, the input is a table. Let us now attempt pivoting by transforming the data which is stored in the table.


Data extraction from JSON file

OK, so that was a CSV. What if we wanted to pass the data from a JSON file? That’s possible, too! The path to the JSON file is stored in variable and fed as an input in order to load the JSON file. In this example, the illustration of a looping construct and conditional constructs is shown.

One of the major hurdles for programmers, when they think of making the switch to Python, is that the language does not use braces to indicate blocks of code. Instead, all blocks of code are denoted by indentation. This indentation is a strict function of the language. If you indent with a wrong number of spaces, the program won’t run correctly. So, you need to be very careful writing code in Python. I recommend using comments generously.


Wrapping up

The core of most businesses today is data: structured or unstructured. And when Machine Learning is used to help the business, the vitality of the core data defines its success. It is always essential to understand the nature of the data, the scope and the purpose of the project, and the algorithms used. These factors determine the rate of success of the project.

The rise of data mining, use of data science, data analytics, big data and other trends are reshaping the role of a database administrator. Today’s applications are increasingly data-intensive, relying on diverse data sources, fast transaction processing, and real-time information. The role of IT is ever-evolving.

I hope you found the article useful in showing a demonstration of use of Python with SQL, which could be one of the initial steps to using the methods for data analytics. There may be different approaches or better alternatives to PANDAS in Python. My goal in this article has been to illustrate a few examples which highlight the use of Python along with SQL Server.

Table of contents

A Quick start Guide to Managing SQL Server 2017 on CentOS/RHEL Using the SSH Protocol  
How to use Python in SQL Server 2017 to obtain advanced data analytics
Data Interpolation and Transformation using Python in SQL Server 2017
An introduction to a SQL Server 2017 graph database 
Top string functions in SQL Server 2017 
Top 8 new (or enhanced) SQL Server 2017 DMVs and DMFs for DBAs
Overview of Resumable Indexes in SQL Server 2017 
Understanding automatic tuning in SQL Server 2017

Prashanth Jayaram
168 Views