Jeffrey Yao

How to Merge and Split CSV Files Using R in SQL Server 2016

February 21, 2017 by

Introduction

From time to time, we may encounter the following scenarios when dealing with data processing:

  1. We have two CSV files that I want to merge them based on one common column value
  2. We want to split a file vertically, for example, an employee csv file, the Salary and DOB fields need to be removed into another file, dedicated only for authorized persons.
  3. We want to split a CSV file horizontally, for example, in a sales CSV file, we want to split the file based on Store name, etc.

All this work can be done at database side. The common approach is to load the whole CSV file(s) into one or two staging tables and then do

  1. After loading two CSV files into two staging tables, use INNER/LEFT/RRIGHT JOIN on the common column to get two tables together. If only common records are needed, we use INNER JOIN, otherwise, use LEFT or RIGHT JOIN.
  2. After loading the CSV file into one staging table, select the needed column list as per requirement to split the table vertically.
  3. After loading the CSV file into one staging table, select the table with a where clause to split the table horizontally.

However, with SQL Server 2016 R integration, we can easily handle this type of work in T-SQL directly without relying on the intermediate staging table(s). This will reduces workload in creating and manipulating staging tables.

Preparing Test Data

We will prepare two short CSV files as shown below, the first is [Student.csv] file, which has 10 students.

The 2nd file is [Student_Score.csv] file, which has 8 records (missing student id 7 and 10 on purpose)

The two files are located in my local C:\Rdata\ folder

Merge Implementation

Now our requirement is to merge the two files based on [StudnetID] column, and using [Student.csv] as the primary file, meaning if a student does not have a corresponding record at [Student_Score.csv] side, we still needs this student record to appear in the merged file. We will save the new file as [Student_Merge.csv].

Here is the code to do the work (the source code can be found at the [Summary] section)

Quick Explanation:

  1. Two csv files are read into each of its corresponding variables via line 3 and 4, the file names are provided by input parameters (line 8, 9, 10) @csv_1 and @csv_2. Notice that the file path is using forward-slash (/) instead of the backward slash (\), this is because backward slash is used as escape character, so if you really want to use backward slash, you need to use double slash i.e. \\.
  2. The two variables [student] and [student_score] are merged via [merge] function by [StudentID] common field, and all [student] records will be kept there via all.X = T, here T is the short abbreviation of TRUE. (line 5)
  3. The merged result is put into variable [student_merge] (line 5) and all the records in this variable will be returned (line 7)
  4. After running the script, we can see StudentID 7 and 10 do have NULL values in [Math] and [English], this is because the original [Student_Score.csv] does not contain these two students.
  5. One thing worth mentioning is that the StudentID field in each CSV file does NOT need to be sorted. For example, if [Student.csv] has the following records

    After running the T-SQL script, we will still get the same result.

    Now we see how to merge the two csv files, next step is we can either import the merged result to a database table using INSERT … SELECT … or we can create a new CSV file as shown below.

    And we can see a new file created under C:\RData\

Quick Explanation:

  1. The code is exactly the same as previous one but we add a write.csv function on line 7
  2. This write.csv function get its file name from a variable [csv_merge], which is populated by an input parameter on line 12.

Vertical Split Implementation

Now assume, I want to split this [student_merge.csv] to [Student_split.csv] and [Student_Score_Split.csv] files with the same field names as in corresponding [Student.csv] and [Student_Score.csv].

Here is the code to do the work:

Quick Explanation:

  1. Read the [Student_Merge.csv] into variable [student_merge] (line 3)
  2. Then through subset() function, we retrieve the columns we need, column list is defined in [select] parameter, such as select = c(“StudentID”, “Name”) (line 6,7,8)
  3. For [student_score_split] variable, we do not want to contain students (like student id 7 and 10) who do not have scores, as such, we use a filter !is.na(student_merge$Math), meaning the records in variable [student_merge] whose [Math] column is not NULL (i.e. NA) (line 7)
  4. Write the two variables [student_split] and [student_score_split] to two csv files.
  5. All the csv file names are provided through store procedure’s input parameters (line 13, 14,15,16)

After executing the script, we will have two newly created files under folder C:\RData\ as shown below

We can open the two files in an editor and see the following result

Horizontal Split Implementation

Just assume we need to split [student_merge.csv] into two csv files, those with Math score >= 80 and those Math score < 80.

Here is the code

Quick Explanation:

  1. Read the [Student_Merge.csv] into variable [student_merge] (line 3)
  2. Use subset() function to filter out the records as per business requirement, i.e. student_merge$Math >= 80 and student_merge$Math < 80 and assign to each variable Student_Math_A and Student_Math_B. (line 6, 7)
  3. Export the two variables [Student_Math_A] and [Student_Math_B] to two csv files.
  4. All the csv file names are provided through store procedure’s input parameters (line 13, 14,15,16)

Now there are two new files created in C:\RData\

When we open the two files in an editor, we will see this

Summary

In this article, we see how we can manipulate a CSV file with R inside T-SQL. This can be very convenient in various file pre-processing scenarios, and no doubt greatly extend the functions of T-SQL.

The following is the complete script I used in this article.

When embedding R script inside the T-SQL, I find the safest way is to use the R core packages and their libraries, such as library(utils) and library(base). If you use some other 3rd party packages, there can be some unknown errors when embedding the R script into the T-SQL, though the R script runs OK outside of T-SQL.

There are many other file processing scenarios I have not discussed but worth some serious trials, such as file merge based on multiple columns, file splitting on complex conditions, adding a calculated column based on other columns, removing some specified records as per business requirements, updating some records or appending some records etc.

In short, with R, we can process CSV files directly which usually cannot be done with T-SQL, thus results in concise and easy-to-maintain codes.

References

The following list contains four R functions used in this article.


Jeffrey Yao
General database design, R, SQL Server 2016

About Jeffrey Yao

Jeffrey Yao is a senior SQL Server consultant with 16+ years hands-on experience, focusing on administration automation with PowerShell and C#. His current interests include: - using data warehousing technology to manage big number of SQL Server instances for capacity planning, performance forecasting, and evidence mining - doing data visualization and analysis with R - doing T-SQL puzzles He enjoys writing and sharing his knowledge View all posts by Jeffrey Yao

168 Views