Univariate Analysis — Uncovering the Intuition behind Analysis Techniques

Esqin Kazimov
15 min readMar 7, 2021

Overview

Univariate data analysis is the simplest form of data analysis. As the name suggests, it deals with one variable. It doesn’t find cause and effect or relationship between variables. The purpose of univariate data analysis is to provide summary statistics on only one variable. If we don’t do enough of a univariate analysis it will result in using resources inefficiently because perhaps data is skewed, or has outliers, or has too many missing values, or has some values that are inconsistent. In this article, I will give you the detailed information about the properties of analysis techniques that can be used during any univariate analysis.

1. 1-D Scatter Plot

During the data exploration, we can create 2D scatter plots (for two variables), 3D scatter plots (for three variables), pair plots (for up to ten variables, because as the amount of variables increases, the pairplot becomes more useless) and other plots (for more than 10 variables) that can be driven from different dimensionality reduction techniques, like PCA, t-SNE etc. What if we want to create 1D scatter plot for one variable? Are there any problems associated with 1-D scatter plots?

Then, let’s create a 1D scatter plot to see its properties and decide whether it is useful for univariate analysis or not. Firstly, let’s import main libraries that are needed in almost every data exploration project:

# Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# Set plot style
sns.set_style("whitegrid")

Then, let’s create our main list that we will work on throughout the article:

# Create a list
main_list = [1, 5, 5, 10, 11, 11, 14, 16, 17, 17, 18, 20, 21, 22, 22, 24, 25, 25, 27, 28, 29, 30, 31, 31, 32, 33, 33, 35, 38, 40, 41, 41, 45, 47, 49, 51]

Now it is time to visualize 1D scatter plot based on the list values:

# Create 1-D scatter plot of the list values
plt.plot(main_list, np.zeros_like(main_list), 'o')
plt.show()

This is how the 1D scatter plot will look like:

1-D Scatter Plot

Immediately, one notices that the points that are far from center can easily be distinguished. If our points were continuous, the closer regions to the center would be more undistinguishable. We have another problem with this plot. As you see, our list contains some points that are overlapping, so it is impossible to observe the overlapping points of the list on the plot if it is a 1D scatter plot. So we can conclude that there are two problems associated with 1D scatter plots:

  1. If points are continuous, points become more undistinguishable.
  2. If there are overlapping points, it is impossible to detect them when we look at the plot.

What is understood from the problems is that we need points to be more well seperated, but the interesting thing is why do we need this? The reason is when we are doing a univariate analysis of the variables, we are aiming to find some interesting details which can be driven from the distribution of the points. Then what are the visualisation techniques to show the distributions of the points? These visualisation techniques include ones which show counts of points (Histograms) and others that show weights (density) of points (PDF, CDF) and combination of both of them (Box plots, Violin Plots).

2. Histograms

In order to understand histograms, let’s group all elements of the list into 50 bins as our list contains the numbers between 1 to 51. The bin width will be 1 meaning that each number between 1 to 51 will have its own place on the X axis and our Y axis will demonstrate the counts of those numbers found in the main list. Now, let’s code to create this histogram:

# Create histogram
plt.figure(figsize = (10, 5))
sns.distplot(main_list, kde = False, bins = 50)
plt.show()

This is our histogram:

Histogram (bins = 50)

So immediately we notice the difference between 1D scatter plot and the histogram. The histogram makes much more sense than 1D scatter plot. We clearly see the counts of each element in the list and the distribution of those points. With reducing the amount of bins, we can make distributions more readable, so let’s look at the code of the histogram that groups all the elements of the list into 5 bins:

# Create histogram of elements in main list a with 5 bins
plt.figure(figsize = (10, 5))
sns.distplot(main_list, bins = 5, kde = False)
plt.show()

This is how our new histogram will look like:

Histogram (bins = 5)

After we reduced the amount of bins to 5, we saw that list elements were almost normally distributed. We can gain all the necessary insights from histograms, with reducing or increasing the amount of bins. To conclude, we can say that histogram is better visualisation method than 1D scatter plots for univariate analysis.

3. Probability Density Function (PDF) and Cumulative Distribution Function (CDF)

Probability density function (PDF) is a statistical expression that defines a probability distribution for a continuous random variable. When the PDF is graphically portrayed, the area under the curve will indicate the interval in which the variable will fall. Cumulative Distribution Function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.

Let’s understand this two terms with simple example. Think that you take 100 people as a sample. Their ages range between 10 and 20. With plotting a PDF, for example, we can see how much percentage of this sample population has the age of between 15 and 18. With plotting a CDF, we can see how much percentage of the sample population has the age of under or above 15.

Now, let’s try to create PDF and CDF based on the values of our main list and then extract some insights from these plots.

From our histogram we created lastly, it can be found that bin width is equal to 10. The bin width is computed using the following formula:

Based on the min, max values of the list and the bin width, we can also find the bin edges manually. For examle, minimum number in our list is 1 and bin width is 10, so the next bin edge will be (1 + 10) = 11. We can also find how many elements belong to each bin. Then, let’s write the code that will automatically calculate the edges of each bin and show the amount of elements belong to per each bin:

# Use np.historam to compute bin edges and 
# element counts per each bin
counts,bins = np.histogram(main_list, bins=5)
print("bin edges :",bins)
print("element counts per each bin :",counts)

Those are our bin edges and element counts for each bin:

bin edges : [ 1. 11. 21. 31. 41. 51.]
element counts per each bin : [ 4 8 10 8 6]

If we look at the documentation of np.histogram() method which we used earlier while calculating the bin egdges and element counts, it has density parameter which helps to show the counts of elements in each bin in the form of probability density of variables. First let’s unterstand that density=True parameter, then it will be easy to plot PDF and CDF.

count, bins = np.histogram(main_list, bins=5, density=True)
print("bin edges :",bins)
print("counts per each bin using density=True:",count)

Here is the output:

bin edges : [ 1. 11. 21. 31. 41. 51.]
counts per each bin using density=True: [0.01111111 0.02222222 0.02777778 0.02222222 0.01666667]

As you see, using density = True caused element counts per each bin to be shown in the form of probability density. We could also calculate this density manually:

# n is the number of elements in each bin
n = counts
# db is the difference between bin edges
db = np.array(np.diff(bins))
print("Manual calculated densities for each bin", n/db/n.sum())

The result:

Manual calculated densities for each bin [0.01111111 0.02222222 0.02777778 0.02222222 0.01666667]

As you see it is the same result as previous one. Note that the documentation says when you have density=True, the sum of the densities will not be equal to 1. To make the whole sum to be equal to the 1, we just need to use simple logic:

print(counts/sum(counts))

Those are the final densities for each bin:

[0.11111111 0.22222222 0.27777778 0.22222222 0.16666667]

As we already know how the densities are computed, let’s see the remaining process while creating PDFs and CDFs:

# Plot PDF (Probability Density Function
counts, bin_edges = np.histogram(main_list, bins=5, density = True)
pdf = counts/(sum(counts))
print("Densities for each bin", pdf);
print("Bin edges", bin_edges);
cdf = np.cumsum(pdf)
plt.figure(figsize=(15, 7.5))
sns.set_style('whitegrid')
plt.title('PDF and CDF')
plt.ylabel('Densities')
plt.xlabel('Values')
plt.plot(bin_edges[1:],pdf, label = 'PDF');
plt.plot(bin_edges[1:], cdf, label = 'CDF')
plt.legend(loc = 5, prop = {'size':20})
plt.show()

The following plot shows our PDF and CDF:

Now, based on our PDF and CDF, it is easy to get insights. For example, based on the CDF, we can say that 70% of the values is less than 45. Another example, based on the PDF, it can be said that probability of being a value in range between 21 and 31 is 0.27777.

4. Kernel Density Estimation

If you’ve noticed I have used kde = False, when I was writing a code for creating histograms because it would change the scale of y axis. In statistics, Kernel Density Estimation (KDE) is a non-parametric way to estimate the Probability Density Function of a random variable.

To create the Kernel Density Estimate, fisrt normal kernels are placed on each of the data points, then those normal kernels are summed. Let’s code to see those normal kernels based on the points of our list.

First we need to create a dataframe based on our points in the list and then normalise that dataframe:

# Assign the copy of list a into b variable
copy_list = main_list.copy()
# Create a dataframe based on the list values
copy_df = pd.DataFrame(copy_list)
# Normalise dataframe values using Scikit-learn library
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
copy_df = sc.fit_transform(copy_df)

Now, we need to set bandwith of KDE points, then create kernels for each points and then plot those kernels:

from scipy import stats
plt.figure(figsize=(20,10))
#set the bandwidth for the KDE points
bandwidth = 1.06 * copy_df.std() * copy_df.size ** (-1 / 5.)
#set the limits of the y axis
support = np.linspace(-4, 4, 200)
#iterate through the data points and create kernels for each
#and then plot
#the kernels
kernels = []
for i in copy_df:
kernel = stats.norm(i, bandwidth).pdf(support)
kernels.append(kernel)
plt.plot(support, kernel, color="r")
sns.rugplot(copy_df, color=".2", linewidth=3)
sns.distplot(copy_df, bins = 51, kde = False)plt.show()

This is the result:

As we said earlier, after creating the normal kernels (indicated by red lines), they have to be summed to make the kernel density estimate so let’s code to integrate those normal kernels along the given axis using the composite trapezoidal rule and create the real KDE plot:

#Integrate along the given axis using the composite trapezoidal rule #and create the KDE plot
from scipy.integrate import trapz
density = np.sum(kernels, axis=0)
density /= trapz(density, support)
plt.plot(support, density)

As a result, the following is our KDE plot after integrating all the normal kernels:

Now let’s plot the KDE plot using the built in “distplot” function and see that if it matches with our own manually created KDE plot:

sns.distplot(copy_df, kde = True, hist = False)

And the following is the KDE plot created with using the built in method:

As you see it is the same as the one we created earlier from the scratch. So far we unterstood how the Kernel Density Estimatation works internally. Now let’s see the KDE plot of original list elements without normalisation, for this we just need to use kde=True in our code in the code that we have used for the histograms. The new code will be like this:

# Create histogram of elements in list a with 50 bins
plt.figure(figsize = (10, 5))
sns.distplot(main_list, kde = True, hist= True, bins = 50)
plt.show()

As a result, this is the KDE of our elements in the original list:

As KDE plots are non-parametric and smoother way of estimating PDFs, you don’t lose any value by snatching ranges of values into bins. You can set a larger bin value, overlay a KDE plot over the Histogram and observe all the relevant information about distribution of the variables.

5. Statistical Analysis using Mean, Variance, STD, Median, MAD, Percentiles, Quantiles, and IQR

5.1 Mean

Mean simply shows the average. It is calculated with following formula:

With Python we can easily find the mean value. Let’s find mean value of our main list:

# Mean
print(np.mean(main_list))

The mean value of our list is:

26.25

The main disadvantage of the mean is that outliers can affect the result drastically.

5.2 Variance

Variance shows the spread of values from mean value. It calculates the average squared distance of each point from the mean value. It is calculated with the following formula:

5.3 Standard Deviation (STD)

Square root of a variance is standard deviation.

Let’s calculate the standard deviation of the list using Python:

# Std-deviation
print(np.std(main_list))

The standard deviation of the list values is:

12.71563997602952

We saw that Mean, Var, Std-dev are depend on the outliers. Then, what are the any methods that are equivalent to them and not affected by outliers? Now I will talk about some of them.

5.4 Median Value

Median value is the middle value of sorted list. There can be two cases while calculating median value:

Case1: Our sorted list contains numbers that are amounted to any odd number. For example, let’s say our sorted list is [1,2,3,4,5,6,7]. As you see, our list has length of 7 which is odd number. To find the index of middle value, we will use the following formula:

If we calculate the index, it will be 4. So we can find the middle value. At index 4 of our list, we have 4. So our median value becomes 4.

Case2: Our sorted list contains numbers that are amounted to any even number. For example, let’s say our sorted list is [1,2,3,4,5,6,7,8]. As you see, our list has length of 8 which is even. In this case, we have to find the indexes of two middle values. To find the indexes of middle values, we will use following formulas:

If we try to calculate both indexes, the first index will be 4 and the second index will be 5 in our case. Based on these indexes we can find that first middle value at index 4 is 4 and second middle value at index 5 is 5. Median value of the list with length of even number is the simple average (mean) of its two middle values. So the average of 4 and 5 becomes 4.5.

As we already know how the median value is computed, let’s calculate the median value of our main list using Python:

# Median
print(np.median(main_list)

Here is the median value of the list:

26.0

Now, let’s understand why the median value is not corrupted by an outlier.

X1 = [1,2,3,4,5,6,7]            # Without outlier
X2 = [1,2,3,4,5,6,7,1000] # With outlier
print('Mean value of X1 is', np.mean(X1))
print('Median value of X1 is', np.median(X1))
print('*'*50)
print('Mean Value of X2 is:', np.mean(X2))
print('Median value of of X2 is', np.median(X2))

Let’s look at the results:

Mean value of X1 is 4.0
Median value of X1 is 4.0
**************************************************
Mean Value of X2 is: 128.5
Median value of of X2 is 4.5

As it is seen, mean value can easily be corrupted. If more than 50% of the points in the list is outliers then the median value will be considered as corrupted.

5.5 Percentile

Let’s say we have a sorted list like following:

X = [ x1, x2, x3, …, x50, x51, x52, … , x97, x98, x99, x100]

For example,80th percentile shows the value at 80th position in the list. It explains that 80% of the values is less than a value that stands in the 80th position in our sorted list. Special percentages like 25%, 50%, 75% and 100% are called quartiles.

Let’s find the values from our main list that the quartiles denotes:

# Quartiles
print(np.percentile(main_list, np.arange(0, 101, 25)))

Here are the values:

[ 1.  17.  26.  33.5  51.]

Let’s find the value stands at 90% percentile:

# 90% percentile
print(np.percentile(main_list, 90))

The value is:

43.0

What it expresses is that 90% of the values in the list is less than 43.

5.6 Median Absolute Deviation (MAD)

MAD is used for the same purposes as the Standard deviation is used. It basically measures how the points are far away from median value. It is caluculated using the following formula:

(|xi — median|) — This expression from above formula expresses absolute deviations of points from median value. After we find those absolute deviations, the process finishes with computing the median of those absolute deviations. That is why it is called Median Absolute Deviation (MAD).

As we know how the MAD is computed, let’s calculate the median value of our main list using Python:

# Median Absolute Deviation
from statsmodels import robust
print(robust.mad(main_list))

Here is the MAD of the list:

13.343419966550417

Now let’s see why MAD can be more useful than Std-dev:

X1 = [1,2,3,4,5,6,7,8]              # Without outlier
X2 = [1,2,3,4,5,6,7,8,1000] # With outlier
print('Std-dev of X1 is', np.std(X1))
print('MAD of X1 is', robust.mad(X1))
print("*"*50)
print('Std-dev of X2 is', np.std(X2))
print('MAD of X2 is', robust.mad(X2))

Let’s look at the results:

Std-dev of X1 is 2.29128784747792
MAD of X1 is 2.965204437011204
**************************************************
Std-dev of X2 is 312.8629250591115
MAD of X2 is 2.965204437011204

As it is seen, although we used outlier, it didn’t affected to MAD but changed Std-dev extremely.

5.7 Inter Quartile Range (IQR)

One more related idea is called IQR. It just shows the difference between 75% percentile value and 25% percentile value meaning that 50% of the values in the list lies in this range. Sometimes it is multiplied by 1.5 and added to 75% quartile value and subtracted from 25% quartile value in order to discern outliers.

6. Box Plots

Now let’s understand the concept of Box plots and Whiskers as we covered all the necessary statistical terms to understand these plots. A Box and Whisker plot — also called a box plot — displays the five number-summary of a set of data. The five-number summary is the minimum, first-quartile, median, third quartile, and maximum.

While the histograms are good to visually see how many points exists at what range but we cannot read from histograms what is the 25% quartile value or the 100% quartile value by just looking at it. We can solve this problem by plotting CDFs but it takes too much effort to be prepared. Box plots can clearly give us those quartile values visually with just one line of code. Let’s create a Box plot and understand how useful it is:

# Create a Box-plot with whiskers
sns.boxplot(data = main_list)

This is our Box plot:

As you see, there is a blue box. The line inside this box coressponds to the 50% quartile value which is approximately 26. The lower boundary of the box shows 25% quartile and the upper boundary of the box shows 75% quartile value of the list. The width of the box has no significance. The lines extending parallel from the boxes are known as the “whiskers”, which are used to indicate variability outside the upper and lower quartiles. The whiskers extend to the the lowest (highest) data point still within 1.5 IQR of the lower (upper) quartile. If there are points outside the whiskers, they are accepted as outliers.

7. Violin Plots

So far, we have seen KDE and Box plots. Both of them have their own advantages and disadvantages. What if we combine both of them? Such kind of plots are called Violin Plots. A violin plot is a method of plotting numeric data. It is similar to a box plot with the addition of rotated KDEs on both sides to show the probability density of the data at different vallues, usually smoothed by a kernel density estimator. Let’s create a Violin plot and see how useful it is:

# Create a Violin plot
sns.violinplot(data=a, size=8)

Here is our Violin plot:

The black line in the violin is equivalent to the Box plot and whiskers. It means our Violin plot has already Box plot inside it. The plots we are seeing on the sides are KDEs and they are symmetric to each other.

Numbers have an important story to tell. They rely on you to give them a clear and convincing voice. (Stephen Few, Data Visualisation Expert)

You can glance through my jupyter notebook here and try-test with different approaches or if I failed to capture any useful information in my own approach, do share that too in comments.

--

--