## Q1: In the data science terminology, how do you call the data that you analyze?

In data science, you analyze **datasets**. Datasets consists of **cases**, which are the entities you analyze. Cases are described by their **variables**, which represent the attributes of the entities. The first important question you need to answer when you start a data science project is what exactly is your case. Is this a person, a family, an order? Then you collect all of the knowledge about each case you can get and store this information in the variables.

For more information, see the article: Introduction to data science, data understanding and preparation

## Q2: What are the data structures you use to store the datasets?

In SQL Server, you store the dataset you are analyzing in a **table**. A table can by physical or a virtual one, which is a **view**. SQL Server tables follow the relational model, meaning that they represent a set. Set theory is the basis for the relational model. Cases are **rows**, and variables **columns**.

R introduces the structure called **data frame**. A data frame is a **matrix**. The basis is linear algebra.

Python follows R and also introduces data frames. However, unlike with R, where the data frame structure is a part of the core engine, you need to use an additional library called **pandas** in Python in order to get data frames available.

For more information, see the article: Introduction to data science, data understanding and preparation

## Q3: Is there a difference between a table and a data frame?

Yes, there is an important difference. An SQL Server table represents a set, a data frame a matrix. Because the order of rows and columns in a set is not defined, you cannot refer to the data values by their **position**. In order to read a value from a table, you need to know the table name, the key of the row, and the name of the column. In an R or Python data frame, you can use numerical positional indexes to read the data values in their position.

For more information, see the article: Introduction to data science, data understanding and preparation

## Q4: Do all variables measure data values in the same way?

No, there are two basic classes of variables:

**Discrete variables**have a limited pool of possible distinct values. Discrete variables can be categorical, ordinal, or dichotomous.**Continuous variables**have an infinite pool of values. Continuous variables can be open or closed intervals or true numerical variables, or true numeris.

## Q5: How do you get an overview of a distribution of a discrete variable?

You should use **frequency tables**, or shortly **frequencies**. There can many different pieces of information in a frequency table. At minimum, there must be values and counts of those values, or an absolute frequency. You can also show the absolute percentage, the cumulative frequency, and the cumulative percent. Graphically, you use **bar charts** and **histograms**.

## Q6: Do you treat ordinal discrete variables differently than categorical?

Ordinal variables, or ordinals, have an **intrinsic order**. You need to define this order correctly to get the appropriate results when you use them for analysis.

Categorical variables are also called **nominal**, because they provide only the name for each category, and nothing more.

For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies

## Q7: How do you define the order of ordinals?

There are many ways how to define the order of ordinals correctly. In T-SQL, you can use the **CASE** expression to modify the data values. In R and Python, you have additional possibilities besides modifying the data values. In R, you define all discrete variables as **factors**, and distinct values are called **levels**. For ordinals, you also define the order of the levels. In Python, you use the **astype(‘category’)** method of a data frame to define discrete variables, and then the pandas **cat.reorder_categories()** function to define the proper order.

For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies

## Q8: How do you convert discrete variables to numerics?

If the values are ordinals, you can use the numeric **positional index** of the categories as the numerical representation of the variable. For nominals, you cannot use a number for a category, because numbers have an intrinsic order. You create a new **indicator** variable for every possible value of a nominal, showing whether the value is taken or not. Such indicators are called **dummies**.

For more information, see the article: Data science, data understanding and preparation – ordinal variables and dummies

## Q9: How do you create dummies from nominals?

In SQL Server, you can use the **CASE** expression again. In addition, since there are only two possibilities for a dummy, 0 or 1, you can also use the T-SQL **IIF()** function, which is a shortcut for CASE when you have only two possible outcomes.

In R, you can use the **dummy()** function from the **dummies** package. In Python, you can use the pandas **get_dummies()** function.

## Q10: How do you convert numerical variables to discrete?

**Binning**, or **discretization**, is the process of creating discrete variables from numerics. Note that there are many different ways of binning. Nevertheless, since numerics have an order, you typically want to preserve the order. Thus, you create ordinals from numerics.

For more information, see the article: Data science, data understanding and preparation – binning a continuous variable

## Q11: What are the different way of discretization?

There are many different ways of doing the binning. Each possibility has its own advantages and disadvantages. The most popular ones are:

**Equal width**binning, where the width of each bin (the interval of a continuous variable) is equal**Equal height**binning, where you have equal number of cases in each bin, but the width of the bins varies**Custom**binning, where you define the bins based on the content of the data, or based on the business logic

For more information, see the article: Data science, data understanding and preparation – binning a continuous variable

## Q12: What are the advantages of different ways of binning?

The most important advantages of different ways of binning are:

- For equal width binning, you
**preserve the shape**of the distribution - For equal height binning, you maximally
**preserve the information**in the variable - For custom binning, you
**follow the logic of real life**

For more information, see the article: Data science, data understanding and preparation – binning a continuous variable

## Q13: How do you do the discretization?

In T-SQL, you use the **CASE** expression for equal width and custom binning. You can use the **NTILE()** window function for equal height binning.

In Python, you use the pandas **cut()** function for equal width and custom binning. For equal height binning, you can use the **qcut()** function.

In R, you can use the **cut()** function from the base installation for equal width and custom binning. For equal height binning, you can search for a function is some additional package. Alternatively, it is quite simple to write **your own function**.

## Q14: Is there a measure for the amount of information in a variable?

Yes, the measure for the information is the **entropy**. Entropy is defined in the information theory branch of applied mathematics. Basically, the information is the same thing as the surprise. If you already know a piece of information, you are not surprised when somebody tells it to you or when you read or see it again. Entropy is a quantified measure for the information.

For more information, see the article: Data science, data understanding and preparation – entropy of a discrete variable

## Q15: Which variable has a higher entropy, a constant or an equal height binned one?

Entropy is also uncertainty. The more uncertainty, the higher possible surprise. There is no uncertainty in a variable that occupies only one value for all cases, for a **constant**. Entropy of a constant is zero. Constants are not useful for analyses. On the other hand, variables with equal number of cases in each class have the maximal possible uncertainty for a specific number of distinct classes.

## Q16: Which variable has a higher entropy, a numeric or a discrete one?

Numerical variables can have more information than discrete ones, although this depends on the distribution. The maximal possible entropy of a discrete variable is when you have equal height distribution. The maximal possible entropy of a continuous variable for a given variance is when the distribution is normal. With more distinct classes of a discrete variable, the maximal possible entropy raises. Therefore, when you discretize a variable, you are losing some entropy.

## Q17: How do you calculate the entropy of a discrete variable?

In T-SQL, you need to do your own calculation, there is no function out of the box. You need the **aggregate functions**, **window aggregate functions**, and the **LOG()** function.

In Python, you can use the **stats.entropy()** function from the **scipy** library. However, you need to calculate counts in advance.

In R, you can use the **Entropy()** function from the **DescTools** package. This function also expects the counts as the inputs.

## Q18: Are R and Python data frame positional indexes the same?

No, there is a slight difference when you refer to the data in R or in python Pandas data frame. Python index is zero-based, R one-based. When you define an interval by an index, in Python the interval does not include the upper bound, while in R does. Therefore, TM[3:6, 1:4] returns the same rows and columns as TM.iloc[2:6, 0:4] in Python

For more information, see the article: Data science, data understanding and preparation – basic work with datasets

## Q19: What do you use in T-SQL to read the data?

In T-SQL, you use, or course, the mighty **SELECT** statement to read the data. The core elements are:

- The
**FROM**clause, where you define the source tables and**how**to join them - The
**SELECT**clause, where you define the columns, or the**projection**, and the computed columns - The
**WHERE**clause, where you define the**filters**for the rows - The
**ORDER BY**clause, where you define the**order**of the rows returned

For more information, see the article: Data science, data understanding and preparation – basic work with datasets

## Q20: What are the basic operations on data frames in R?

In R, here are some basic operations on data frames:

- Retrieve the values by the index
**positions**for rows and columns - Retrieve the values by row and column
**names** - Join two data frames with the
**merge()**function **Bind**data frames**by columns****Bind**data frames**by rows**

For more information, see the article: Data science, data understanding and preparation – basic work with datasets

## Q21: What are the basic operations on data frames in Python?

In pandas, there are many functions that help you operating on data frames. The most basic manipulations you can do include:

- Creating projections by using a subset of columns
- Retrieve the values by the index
**positions**for rows and columns - Locate the values by a
**predicate**for the rows and columns - Join two data frames with the pandas
**merge()**function - Reorder a data frame with the
**sort()**method

## Q22: How do you aggregate data in T-SQL?

If you need to aggregate the data in T-SQL, you use the **aggregate functions**. There are the basic aggregate functions available, including some statistical functions; however, the set of aggregate functions is quite limited.

If you need to perform the aggregations in groups, you use the **GROUP BY** clause. If you need to filter on the aggregated values, you use the **HAVING** clause. You can create multiple groupings in a single statement with the **GROUPING SETS** clause.

For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data

## Q23: How do you aggregate data in R?

In R, you can start with the **summary()** function from the base installation. It gives you a quick overview of a variable distribution with descriptive statistics. It returns multiple measures. There are many individual aggregates and statistical functions already in the base package. Of course, you get countless more with additional packages. You can use the **aggregate()** function from the base installation to do the aggregations in groups.

For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data

## Q24: How do you aggregate data in Python?

In Python, you use the pandas aggregations on the data frames. A pandas data frame has the **describe()** method, which gives you an overview of a distribution with descriptive statistics similarly to the R summary() function. There are many more pandas data frame methods for calculating individual descriptive statistics values. You can use the **groupby()** data frame method to calculate the aggregations in groups.

For more information, see the article: Data science, data understanding and preparation – grouping and aggregating data

## Table of contents

### Dejan Sarka

View all posts by Dejan Sarka

#### Latest posts by Dejan Sarka (see all)

- Data Science in SQL Server: Unpivoting Data - October 29, 2018
- Data science in SQL Server: Data analysis and transformation – Using SQL pivot and transpose - October 11, 2018
- Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II - September 28, 2018