In this article, we will focus on SQL Machine Learning using R scripts and the use of external packages.
SQL Server revolves now no more just around a traditional database system. It extended its capability using machine learning by R, Python and Java languages.
- Python: Python is an interactive, high-level and Object-oriented programming language. You can run the Python code for complex algorithms and versatile workflows using friendly commands. You can browse SQLShack articles for more details on Python
- R language: It is a popular programming language for statistical data computing. It is widely used by data scientists, statistical and mathematical computations. You can use it on UNIX, Windows and macOS
Install R in Machine Learning for SQL Server 2019
In this article, we don’t intend to go into details about R installation. You can refer to the article, Machine Learning Services – Configuring R Services in SQL Server for detailed steps.
In the SQL Server installation, put a check for the SQL Machine Learning Service and Language Extensions for R Scripts.
You can choose to select R, Python and Java as your preferred SQL Machine Learning language.
- Machine Learning Services (In-Database): It installs the SQL Server Launchpad service to support R and Python script execution
- R: It installs the R packages, interpreter, and open-source R to execute the R scripts
- Python: It installs Microsoft Python packages, the Python 3.5 executable and default libraries
In the above screenshot, we see a Red box having shared features for the R and Python machine learning server (standalone). Do not select these shared features on the server where you installed the Machine learning Services. It is required for a standalone server where R services don’t exist.
Once the installation wizard completes, you get the following confirmation screen.
Create an environment variable for SQL Machine Learning
In the Windows Server, navigate to Control panel -> System and Security ->System ->Advanced System Settings -> Environment Variables.
Create an environment variable with the following details:
- Variable Name: MJK_CBWR
- Variable value: Auto
Click Ok and restart the Windows server.
Enable script execution using sp_configure
Connect to the SQL instance using SSMS and enable external services. It enables the interaction of SQL instance with the external SQL machine learning services for R scripts.
EXEC sp_configure 'external scripts enabled', 1
RECONFIGURE WITH OVERRIDE
Now, restart the SQL Services. After restarting SQL services, both SQL Server service and SQL Server Launchpad service should be in running state as shown below.
Default R library
By default, R services are installed in the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\R_SERVICES where C is the root directory for your SQL instance. here, the INST1 is the SQL Server instance name.
In this path, open the library folder, and it shows the installed instance library. SQL Server uses loads the respective functions from the instance library for running the R script.
Run the following SQL script to verify the default R package library. It uses the sp_execute_external_script stored procedure and specifies R in the language parameter.
@language = N'R',
@script = N'OutputDataSet <- data.frame(.libPaths());'
WITH RESULT SETS (([DefaultLibraryName] VARCHAR(MAX) NOT NULL));
For SQL Machine learning and statistical computing, R script uses various functions and packages to process the required data. By default, SQL Server installs useful packages. To get a list of installed packages in the R services, you can run the below T-SQL.
EXECUTE sp_execute_external_script @language=N'R'
, @script = N'str(OutputDataSet); packagematrix <- installed.packages(); NameOnly <- packagematrix[,1];
OutputDataSet <- as.data.frame(NameOnly);' ,@input_data_1 = N'SELECT 1 as col'
WITH RESULT SETS ((PackageName nvarchar(250) ));
It gives you a list of 58 packages preconfigured with the SQL Server R installations.
Let’s go through the useful external R libraries.
Dplyr is useful for data manipulations. You can load data from the SQL tables or data sources to filter data or columns, sort or group data, add new variables, create new columns. Refer to the Dplyr documentation for more details.
This package is useful for data visualization purposes. You can visualize your data interactively using the graphs, charts, curves, histograms, scatter plots. You get drag and drop functionality to customize your visuals. Refer to these documents for more details.
We use the Lubridate library in R for the data wrangling. Many times, you need to deal with the data time data for time series data analysis. You can work with the date-range and time-spans in a friendly way. You can also update the components such as years, months, days, hours, minutes, and seconds with it. Refer to Lubridate documentation.
RODBC is useful for the ODBC database interface to various data sources. Refer to this documentation.
It is a widely used library for the data visualizations in R services by the data scientists. You can use this library to build graphics, plots, visual statistics. Refer to this documentation.
Data cleaning is an essential aspect of data analysis. You do not want duplicate values, duplicate columns or columns without any value for your data analysis. It contains useful functions to remove duplicates, adds tabulating functionality such as adds a new column based on the other column values. Refer to this documentation.
XLConnect, xlsx2dfs ,xlutils3 xlsx
These packages are useful to interact with Microsoft Excel files.
- XLConnect: It is an excel connection in the R scripts
- xlsx2dfs: it helps you to read and write into the excel sheets using your data frame
- xlutils3: If you need to work with multiple excel files, xlutils3 is helpful for you in extracting multiple sheets together
- xlsx: You can read, write, format the excel files with the XLSX extension
The Read Rectangular Text Data (readr) enables you to read the following files:
- Comma-separated values (CSV)
- Tab-separated values(TSV)
- delimited values
- fixed-width files(FWF)
It is also a popular package in the R scripts to perform data classifications and regressions. You can perform various activities such as bootstrapping, subsampling, Hyperparameter tuning and nested resampling with the mlr library. Refer documentation.
The magick library is useful for image processing using R scripts. You can access various file formats such as JPB, PNG, JPEG, PDF, TIFF using the magick. It gives you the ability to perform image manipulations operations such as rotate, scale, crop, trim, flip, blur.
You can refer to this documentation.
There are many packages in the R scripts that can be used as per your data and analysis requirements. You can check the package availability in your SQL R instance, using the stored procedure sp_execute_external_script. For example, if we want to check for the image processing package magick, use the following query.
@language = N'R',
@script = N'
You get the error message that there is no package called ‘magick’ for your R scripts.
To install a specific external package in R service, launch the R from the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\R_SERVICES\bin.
Now, right-click on the R and launch it an administrator. It opens the following Microsoft R Client application.
Run the following command to install the magick package in R.
It downloads the magick R package, its dependencies as shown below.
Once the external package in R is downloaded, extracted and installed, you get the following screen.
Now, you can rerun the query to check the magick package, and it gives you the following output. In this output, it returns the enabled and disabled features in the magick package.
In this article, we looked at the R services configuration in SQL Server 2019. Further, we explored useful libraries and installation methods. You can install the required package for your data analysis purpose. In the next article, we will explore the data processing using the R script in the SQL Server.
- Azure Analysis Services and Power BI Live connections - April 14, 2021
- An overview of Power BI data models - April 12, 2021
- Capturing deadlocks on AWS RDS SQL Server databases - April 7, 2021