Data science

Ben Richardson

Machine Learning Services – Configuring R Services in SQL Server

October 25, 2018 by

The R language is one of the most popular languages for data science, machine learning services and computational statistics. There are several IDEs that allow seamless R development. Owing to the growing popularity of the R language, R services have been included by Microsoft in SQL Server 2016 onwards. In this article, we will briefly review how we can integrate R with SQL Server 2017. We will see the installation process and will also execute the basic R commands in SQL Server 2017.

Read more »
Esat Erkec

Forecast SQL backup size

October 23, 2018 by

This article will cover the process of analyzing and predicting/forecasting the size of a SQL backup as a means to better handle/manage retention of backups.

One of the main database management tenets is “Do not lose your data”. According to this; a database administrator incurs huge responsibilities to protect data. Under these circumstances, taking database backups and archiving SQL backup files is a key task for database administrators. In data protection strategy, taking database backups and archiving backup file processes play the leading role. Especially, backup planning is very significant for disaster recovery scenarios because backup file will be used for restore operation after any failure or data corruption. For this reason, every dba must generate recovery strategies for possible disaster scenarios and ensure that these scenarios can be solvable. At the same time, these backup files must be tested for data integrity; thus process provides to evaluate the recovery time and integrity of backup files. In this Backup and Restore (or Recovery) strategies for SQL Server database article you can find all details about backup and restore strategies.

Read more »
Dejan Sarka

Data understanding and preparation – grouping and aggregating data II

September 28, 2018 by

You might find the T-SQL GROUPING SETS I described in my previous article a bit complex. However, I am not done with it yet. I will show additional possibilities in this article. But before you give up on reading the article, let me tell you that I will also show a way how to make R code simpler with help of the dplyr package. Finally, I will also show some a bit more advanced techniques of aggregations in Python pandas data frame.

Read more »
Dejan Sarka

Interview questions and answers about data science, data understanding and preparation

July 27, 2018 by

Q1: In the data science terminology, how do you call the data that you analyze?

In data science, you analyze datasets. Datasets consists of cases, which are the entities you analyze. Cases are described by their variables, which represent the attributes of the entities. The first important question you need to answer when you start a data science project is what exactly is your case. Is this a person, a family, an order? Then you collect all of the knowledge about each case you can get and store this information in the variables.

Read more »
Dejan Sarka

Data understanding and preparation – grouping and aggregating data I

July 10, 2018 by

I already tacitly did quite a few aggregations over the whole dataset and aggregations over groups of data. Of course, the vast majority of the readers here is familiar with the GROUP BY clause in the T-SQL SELECT statement and with the basic aggregate functions. Therefore, in this article, I want to show some advanced aggregation options in T-SQL and grouping in aggregations of data in an R or a Python data frame.

Read more »
Dejan Sarka

Data understanding and preparation – basic work with datasets

June 4, 2018 by

In my previous four articles, I worked on a single variable of a dataset. I have shown example code in T-SQL, R, and Python languages. I always used the same dataset. Therefore, you might have gotten the impression that in R and in Python, you can operate on a dataset the same way like you operate on an SQL Server table. However, there is a big difference between an SQL Server table and Python or R data frame.

Read more »
Dejan Sarka

Data understanding and preparation – entropy of a discrete variable

May 14, 2018 by

In the conclusion of my last article, Data science, data understanding and preparation – binning a continuous variable, I wrote something about preserving the information when you bin a continuous variable to bins with an equal number of cases. I am explaining this sentence in this article you are currently reading. I will show you how to calculate the information stored in a discrete variable by explaining the measure for the information, namely the entropy.

Read more »
Dejan Sarka

Data understanding and preparation – binning a continuous variable

April 23, 2018 by

I started to explain the data preparation part of a data science project with discrete variables. As you should know by now, discrete variables can be categorical or ordinal. For ordinal, you have to define the order either through the values of the variable or inform about the order the R or the Python execution engine. Let me start this article with Python code that shows another way how to define the order of the Education variable from the dbo.vTargetMail view from the AdventureWorksDW2016 demo database.

Read more »
Dejan Sarka

Data understanding and preparation – ordinal variables and dummies

March 29, 2018 by

In my previous article, Introduction to data science, data understanding and preparation, I showed how to make an overview of a distribution of a discrete variable. I analyzed the NumberCarsOwned variable from the dbo.vTargetMail view that you can find in the AdventureWorksDW2016 demo database. The graphs I created in R and Python and the histogram created with T-SQL were all very nice. Now let me try to create a histogram for another variable from that view, for the Education variable. I am starting with R, as you can see from the following code.

Read more »
Dejan Sarka

Introduction to data science, data understanding and preparation

March 14, 2018 by

Data science, machine learning, data mining, advanced analytics, or however you want to name it, is a hot topic these days. Many people would like to start some project in this area. However, very soon after the start you realize you have a huge problem: your data. Your data might come from your line of business applications, data warehouses, or even external sources. Typically, it is not prepared for applying advanced analytical algorithms on it straight out of the source. In addition, you have to understand your data thoroughly, otherwise you might feed the algorithms with inappropriate variables. Soon you learn the fact that is well known to seasoned data scientists: you spend around 70-80% of the time dedicated to a data science project on data preparation and understanding.

Read more »
Minette Steynberg

10 things you need to know to become a Data Scientist

August 22, 2016 by

Introduction

If you have been browsing job ads lately, you would have noticed a huge amount of positions available for Data Scientist. The demand seems to be much larger than the supply which means that there is a huge opportunity here. However, there appears to be a catch: Most of these positions requires some experience or knowledge in the field of Data Science. So if you want midway through your career, how can you skill up to become a Data Scientist?

Read more »