In descriptive statistics, one of the most important topics other than finding the centre of the data set values through calculation of median, mean and mode is to measure the variation around the centre if it. The variability between two different data sets can vary a lot even if they show similar or close position of centres.
Central tendency measurement has several different kinds of measuring tools. Similarly, tools of variability also has different measures. These measures variability are also called measures of dispersion or scatter spread. Here we will examine, the most commonly used examples of measures of statistical dispersion including range, interquartile range, variance and standard deviation. These are often used for variables which are quantitative in nature.
Dispersion differs drastically from what central tendency is, but together are the properties of distributions which are used the most.
In Descriptive statistics, the range of a data called a sample range is nothing but the difference between the maximum and minimum observed value.
It can be formally describes as “The sample range is obtained by computing the difference between the largest observed value of the variable in a data set and the smallest one”.
Range = Max − Min.
When we are computing the range, it is a relatively easy process. However, since we only use the maximum and the minimum value with the numbers in between being discarded so we end up ignoring a large portion of the information. Moreover, it should also be noted that the range hardly ever decreases since it can only increase due to changes occurring in the data set, concluding range being very sensitive to the size of the sample.
An example can be depicted: A school hosts a running competition with a bunch of finishing times to be reported. They are reported as follows: 2,6,9,1,2,3. What will the range be? To calculate the range of the participants we need to subtract the maximum value from the minimum, giving us a range of 9-1.
Another range can be: 15,28,12,8,6,21,18,1. Here the range will be 28-1.
In descriptive statistics, Midspread or H-spread are the terms that are also used for Interquartile Range. It is one of the important methods of statistical variability. However before we continue with formal definition of interquartile range we need to first understand what percentiles, quartiles and deciles of a variable mean in a data set. We observed that median being the middle of the range, divided the set into two halves, the top 50% and the bottom. The same way, percentiles diving them to the hundredths in a way that the first two percentiles P1 and P2 are the numbers that divide the lowest 1% and 2% of the data set from the top 99% or 98% respectively. The fiftieth percentile is considered the median. In a similar fashion, deciles divide the them into tenths, with D1, D2 being the 10th and 20th percentile and so on and so forth.
However, quartiles are the most commonly used as mentioned above. It is self - explanatory that these quartiles divide the values into four parts or intoquarters, calling the Q1, Q2 and Q3. Q1 which is considered as the 1st quartile is the value where the top 75% is divided with the bottom 25%. The q2 is the median and Q3 divided from the top 25% with the bottom 75%. In simple words. The quartile is the difference between the upper and lower quartiles or the 75th and 25th percentiles. A box plot is used to depict the quartiles.
Since the interquartile range is a breakdown point of 25%, not like range at all, it is preferred over range. Formally, a quartile can be depicted in the following way:
“Let n denote the number of observations in a data set. Arrange the observed values of variable in a data in increasing order. The first quartile Q1 is at position n+1 4 , 2 with the second quartile Q2 (the median) being at position n+1 2 and the third quartile Q3 being at position 3(n+1) 4 , in the ordered list. If a position is not a whole number, linear interpolation is used.”
Here, because we use quartiles to depict the interquartile range, it can be defined formally as:
“The interquartile range of the variable, denoted IQR, is the difference between the first and third quartiles of the variable, that is, IQR = Q3 − Q1. Roughly speaking, the IQR gives the range of the middle 50% of the observed values.”
We can use an example to showcase our understanding: 9 students in a running competition with finish times of 7,7,31,31,47,75,87,115,116,119,119,155,177. Will give us a median using the above formula of Q2=87. Q1 which is the middle of the top 50%, from the 1st number to the 8th and Q2 which is the middle of the lower 50% from the 8th number to the 13th. They will be 31 and 119 respectively. The IQR will be 119-31=88.
The largest and lowest value along with the quartiles help us find the variations and the centre in a cohesive manner. When they are arranged in an ascending order, they are known as the five-number summary depicted in the m=form: min,Q1,Q2,Q3,max.
To depict these variations, a boxplot is graphically constructed based on the five-number summary to portray the observed values ‘ variations in a data set.
A box plot is a way to depict graphically a data set in the form of their quartiles. They can also have vertical lines extending from the boxes called whiskers. Hence they are often referred to as box-and-whisker plot. Here, outliers are plotted individually. It’s a graphical indicator of the dispersion spread and its outliers.
In descriptive statistics, another form of measure of variability or dispersion is called the standard Deviation. This is one of the most common measures uses to calculate discrepancies and variations in a sample. It is not a simplified process like the ranges. It is denoted by the Greek letter Sigma (σ )or Latin letter s. It can be thought of in a way of an average of the deviations in the observed data from the variable mean.
Formally it can be described as: “ For a variable x, the sample standard deviation, denoted by sx (or when no confusion arise, simply by s),
It is the square root of Variance which we’ll read about below. When mean is used as the measure of centre, it is ideal to use standard deviation as the measure of dispersion. It is important to highlight that standard deviation is always positive.
Moreover, there is an empirical rule for normal distribution or for that matter any symmetric distributions specific to those being bell-shaped. This rule relates the values of a variable that are observed in a data set and that lay around the mean in an interval and the standard deviation.
Looking at the formula, we can see that we need to add the squares of the deviation of each observed value.
This is called sum of squared deviations. Once we obtain that, we divide it by n-1 and get the resulting equation. This is called variance. Variance is considered to be the expectation of all the squared deviations of values in a variable from the mean. Variance has a very important and central role in statistics and is also used in various sciences where statistical analysis of data is needed.
There are other measures of dispersion that are used in descriptive statistics albeit sparingly.
Mean Absolute Difference:This type of measure of dispersion is drawn from the probability distribution. It refers to the average absolute difference of 2 independent values. It is also referred to at the Gini mean difference often denoted as GMD and the absolute mean difference denoted by MD.
Median Absolute Deviation: It is often denoted as MAD. In case of quantitative data, median absolute deviation is a good measure of dispersion and is considered to be the population parameter, calculated from sample from which the MAD is then estimated. This is considered to be a better measure which is more understand of the outliers that standard deviation is where it is considered irrelevant if the deviations are of a minute amount of outliers.
Average Absolute Deviation : This is a unique measure of dispersion as it basically averaged the absolute deviation from the central point. It is considered the summary statistic. It is important to measure both deviation and central tendency to estimate the absolute deviation.
Distance Correlation: This is used in probability theory to estimate the dependence of two random variables. It can only be zero if the these variables are independent in nature.
The above mentioned alternative formulas help in reducing the arithmetic work when doing hand calculation, if the bomber turns out to be complex with multiole decimal places. It Is understood after the previous discussion that if the variation is greater in between the observed values, the larger the standard deviation will be. Finally, standard deviation fulfill the fundamental idea of measures of dispersion and hence is the most commonly used method along with range. However, sometimes due to extreme discrepancies between the values, this can lead to the standard deviation to sometimes being skewed.