Dejan Sarka

Dejan Sarka

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.
Dejan Sarka

Data science, data understanding and preparation – binning a continuous variable

April 23, 2018 by

I started to explain the data preparation part of a data science project with discrete variables. As you should know by now, discrete variables can be categorical or ordinal. For ordinal, you have to define the order either through the values of the variable or inform about the order the R or the Python execution engine. Let me start this article with Python code that shows another way how to define the order of the Education variable from the dbo.vTargetMail view from the AdventureWorksDW2016 demo database.

Read more »

Data science, data understanding and preparation – ordinal variables and dummies

March 29, 2018 by

In my previous article, Introduction to data science, data understanding and preparation, I showed how to make an overview of a distribution of a discrete variable. I analyzed the NumberCarsOwned variable from the dbo.vTargetMail view that you can find in the AdventureWorksDW2016 demo database. The graphs I created in R and Python and the histogram created with T-SQL were all very nice. Now let me try to create a histogram for another variable from that view, for the Education variable. I am starting with R, as you can see from the following code.

Read more »

Introduction to data science, data understanding and preparation

March 14, 2018 by

Data science, machine learning, data mining, advanced analytics, or however you want to name it, is a hot topic these days. Many people would like to start some project in this area. However, very soon after the start you realize you have a huge problem: your data. Your data might come from your line of business applications, data warehouses, or even external sources. Typically, it is not prepared for applying advanced analytical algorithms on it straight out of the source. In addition, you have to understand your data thoroughly, otherwise you might feed the algorithms with inappropriate variables. Soon you learn the fact that is well known to seasoned data scientists: you spend around 70-80% of the time dedicated to a data science project on data preparation and understanding.

Read more »