Rajendra Gupta
SQL Server feature selection for Python SQL

Data Sampling with Python SQL Scripts

May 9, 2019 by

Introduction

The Python programming language is object oriented, easy to use and, widely used programming language. We can write simple Python SQL Scripts and it performs the task that is not easy to do with t-SQL.

In my previous article, Using Python SQL scripts for Importing Data from Compressed files, we explored that the combination of Python SQL Scripts and SQL Server can be an exciting combination like Peanut Butter and Chocolate ?

We might need to represent data on a different frequency and need to write t-SQL code to get data at various samples. Suppose we have data on yearly frequency. We need to represent data in monthly distribution. It is not an easy task to do with a t-SQL programming language. We can use Python SQL Scripts and use different modules to do frequency conversion. In this article, we will understand this in a combination of disk capacity data in SQL Server.

Disk space monitoring and proactively planning for space requirements are the core tasks of DBA. It is one of the essential elements of capacity planning. We must monitor the database growth over time to get requirements of disk space. Usually, we configure alerts to notify DBA in case of free disk space is less than a specified threshold. It is a good approach to maintain free space threshold (around 20% free space), but we should try to be more proactive and give the requirements to storage administrators in advance. We do not want to come in a situation where you want to increase space on a particular disk, but data store does not have sufficient free space to give it to you. We might need to do other stuff such as moving drive around data store that might require downtime.

We might need to consider many parameters for planning disk space requirements such as free space inside the data and log file, auto growth, disk space growth. In this article, we will focus on disk space growth.

We can gather disk space data using the DMV sys.dm_os_volume_stats with Join on sys.master_files. It captures the disk space for all disks consisting of database files. Usually, we do not place database files in the system drive. System administrators should maintain these drives.

We get the output in the following format.

disk space

We should capture this data in a SQL table regularly. Suppose we do not have disk space data available with us. We only have few values on which basis we want to resample this data on a daily or monthly basis. We can write SQL queries to do this task for us. It might be complicated tasks for you if you are not much familiar with t-SQL. Let’s understand the concept of data resampling in the upcoming section.

Data resampling overview

We can increase or decrease the frequency of data based on the time series and generate values for it with different methods. Suppose we have a monthly distribution of data for stock prices in share market. We want to generate samples at a weekly or daily basis. Similarly, if we have weekly data, we might wish to data resampling on a monthly or quarterly basis.

We can resample data in two ways

  • Upsampling: We increase the date-time frequency in Upsampling. Example of Upsampling is as follows.
    • Convert Yearly to Quarterly data
    • Convert Quarterly to Monthly data
    • Convert Monthly data into Weekly data
    • Convert Weekly to Daily Convert
    • Convert Daily to Hourly data
  • Downsampling: It is a reverse process to Upsampling. In this, we decrease the data-time frequency of data.
    • Convert Hourly to Daily data
    • Convert Daily to Weekly data
    • Convert Weekly data to Monthly data
    • Convert Monthly to Quarterly Convert
    • Convert Quarterly to Yearly data

In the following sections, let see how to use Python Scripts to do data resampling.

Using Python SQL scripts for data Resampling

We can use SQL Server 2017 or above to use Python with SQL Server. We do not install all features in SQL Server while doing the installation. To use Python Scripts, we can use following Prerequisites. You can refer articles in TOC at the bottom to get detailed information on this.

Prerequisites

  • SQL Server 2017 or later version
  • Install the Machine Learning Services (Python) to run the Python SQL scripts

    SQL Server feature selection for Python SQL

  • Enable external scripts enabled using sp_configure command.

    Verify 'external scripts enabled' Python SQL results

  • Restart SQL Server Service and SQL Server Launchpad service. Both of the services should be in running state to execute Python Scripts.

    Verify SQL Server service and Launchpad service for Python

  • Test Python SQL Script – We can use simple Python SQL Script to test if it is working fine or not. It would avoid later environmental issues.

Verify sample script out in Python

Once we have prepared the environment, let’s create a sample data and insert data into it.

We have the following data in the table. In this data, the first row has value for 1st April 2018 while the second row shows 28th April 2019 value.

Sample Data

Python SQL Script for Upsampling data

Suppose we want to resample this data daily. Execute the following Python Script to get Upsampling data weekly.

We get the following output of Python SQL Script. In this output, data is equally distributed into a weekly basis based on the actual values we have in the table.

Python SQL Script for Upsampling data

Similar we can run the following Python SQL Script to get data daily.

 Upsampling data example

Let’s truncate the table and insert records on a yearly basis.

We want to resample data into monthly basis, execute the following Python SQL Script.

We can see monthly distribution of data with Python Script.

 Upsampling data example

Explore a Python SQL Script

Let’s understand a Python script in detail.

Part 1: Import Python Module: in SQL Server, we can execute Python SQL Script with stored procedure sp_execute_external_script and specify Python language as a parameter.

We can use Pandas module in Python Script to resample data. We can use resample() function in Pandas module. We also need to convert data type date to datetime by the Python module datatime. We can use the resampling module on DateTime column only.

Part 2: Resample the frequency of data :

In this part, we specify the index column of the pandas data frame. Index column is the column to use in data sampling.

In the first line, we convert data type date to datetime using function to_datetime. Once data type conversion finishes, we can drop existing column and set index on new column with datetime data type.

In the next line of code, we want to resample data using monthly frequency and linear interpolation. Linear interpolation equally distributes the values among time range.

We can have the following values in the resample function.

  • W: Weekly frequency
  • M: month end frequency
  • Y: Yearly frequency
  • D: Daily frequency
  • Q: quarter end frequency

Part 3: Specify input and output in a Python SQL Query

In this part, we will specify the input (SQL query to get records from the existing table) and output to display data with an appropriate data type.

Previously we defined disk space column as an integer. In the output, we can choose an appropriate data type. In this example, we defined the data type as decimal up to 2 points.

Python SQL Script to Downscaling data

Suppose we have data in sample table daily. We want to resample data from daily to monthly using SQL Python Scripts.

It performs the aggregation on the daily data and gives us month data.

Python SQL Script to Downscaling data

Similarly, we can run the code to get aggregated data quarterly.

example of Downscaling data

If we do not have supported data types, we get the following error message.

Supported types: bit, tinyint, smallint, int, bigint, uniqueidentifier, real, float, char, varchar, nchar, nvarchar, varbinary, date, DateTime, smalldatetime.

Error of invalid data types

Conclusion

Python is simple and easy to understand programming language, and we can perform many important tasks using it. In this article, we explored the use case of Python SQL Scripts to resample data on a different frequency. I will continue exploring mode Python SQL Scripts and share my knowledge in further articles. I hope you like this article.

Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
Python

About Rajendra Gupta

Hi! I am Rajendra Gupta, Database Specialist and Architect, helping organizations implement Microsoft SQL Server, Azure, Couchbase, AWS solutions fast and efficiently, fix related issues, and Performance Tuning with over 14 years of experience. I am the author of the book "DP-300 Administering Relational Database on Microsoft Azure". I published more than 650 technical articles on MSSQLTips, SQLShack, Quest, CodingSight, and SeveralNines. I am the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups. Based on my contribution to the SQL Server community, I have been recognized as the prestigious Best Author of the Year continuously in 2019, 2020, and 2021 (2nd Rank) at SQLShack and the MSSQLTIPS champions award in 2020. Personal Blog: https://www.dbblogger.com I am always interested in new challenges so if you need consulting help, reach me at rajendra.gupta16@gmail.com View all posts by Rajendra Gupta

168 Views