It is relatively easy to describe and statistically infer a single variable. However, majority of statistical research has two or more variables. The relationship between the observed values of two variables is called Bivariate data. There are various possibilities of each variable in question. They can both be qualitative or quantitative or both, we evaluate these possibilities below:
Two qualitative variables with certain observed values give us bivariate qualitative data.
In case which consists of a single variable of qualitative nature, the amount of times it occurred i.e., the frequency distribution is depicted through a frequency table. However, in this case which has two variables of qualitative nature, a two-way frequency table is used to depict their join distribution. They are also called contingency tables.
Both the variables here are referred to as row and column behaviour. The categories which are sometimes referred to as classes of the row variable are settled onto the left margin with the second variable settled in the upper margin and the cells recoding the frequencies. This summarization of two variable data with a two-way frequency table is also called cross-tabulation.
There are various presentations of the tables of frequencies. The easiest one being a 2 x 2 table for qualitative variable having two categories. Here, the 1st number depicts the number of rows and the 2nd number show the number of column. We can also have 2 x 3 tables as well as 3 x 3 and 3 x 4 depending of categories of the respective variables.
Since by-hand calculation can become difficult and exhausting. We use SPSS to compute the data. SPSS is a universally used program for conducting statistical analysis. It is a software that is widely used by social science researchers, health researchers, surveyors, data miners, market researchers. It is used in descriptive statistics for the role of constructing ratio statistics, cross tabulation and frequencies. Along with that, it helps is conducting means, t-test, ANOVA, correlation, Bayesian etc. in Bivariate statistics. It also conducts leaner regression and various other simulations.
We can depict this in a renowned example which was conducted via SPSS
Let the blood types and gender of 40 persons are as follows: (O,Male),(O,Female),(A,Female),(B,Male),(A,Female),(O,Female),(A,Male), (A,Male),(A,Female),(O,Male),(B,Male),(O,Male),B,Female),(O,Male),(O,Male), (A,Female),(O,Male),(O,Male),(A,Female),(A,Female),(A,Male),(A,Male), 64 (AB,Female),(A,Female),(B,Female),(A,Male),(A,Female),(O,Male),(O,Male), (A,Female),(O,Male),(O,Female),(A,Female),(A,Male),(A,Male),(O,Male), (A, Male),(Female),(Female),(Ambala).
Summarizing data in a two-way frequency table by using SPSS:
We can assume that one of the variables have “I” categories and the second variable of qualitative nature has “j” categories. This distribution which is called the joint distribution can be depicted by “i x j” on the table. If the total number of people the data has been collected from and then observed, called the sample, is considered to be n and the I x j cell occurs fij times, its relative frequency can then be considered as :
Frequency in the i x j th cell / Total number of observation = fij / n
If we further multiply this relative frequency by 100 we gets its percentage.
To further calculate the totals of the column and row we use this formula:
f.j = f1i +f2j + f3j +…+ fij, for the column total and
fi = fi1 + fi2 + fi3 +…+ fij, for the row total.
Now on the basis of both totals of row and column, we can estimate the relative frequencies of each by these formulas:
ij / fi and fij / f.j respectively.
These give us the conditional distribution of both the row and column.
We can use the same gender and blood type example to depict this further via SPSS.
Here in this example, we tried to depict from one a particular value of the column variable, the conditional distribution of the column variable and from one particular value of the row variable, the conditional distribution of the row variable. Now we wonder which one of the conditional distributions to use and why did we in fact calculate those.
The fundamental reason is that if there is any kind of association between the two different variables, we can find that out form conditional distributions. The conditional distributions of the first variable vary with each row if only the second variables percentages are different and vice versa. This shows that both the variables affect each other.
Whether the association is direct or not is also depicted through the conditional distribution’s shapes. If both the variable percentages are similar, it is but obvious that are independent.
Whether we should use the first or second variable for the purpose of inference depends on whether they are explanatory or response variables.
Formally, they can be defined as: “A response variable measures an outcome of a study. An explanatory variable attempts to explained the observed outcomes.”
Sometimes, it becomes difficult to understand which variable is each and we end up using the percentages to identify them. However, we need to note that just because there is an association between the two variables, it doesn’t necessary imply that one variable is causing that to the other variable. Association doesn’t imply causation.
The best way to graphically represent Bivariate Data is through bar graphs which can be either clustered or stacked bar graphs.
There are other graphical representations that can be used as well such as pie chart etc.
The above Gender and blood type example can be depicted via bar graphs down below:
Figure: The blood type and gender being graphically depicted via a stacked bar graph and a pie chart respectively.
As we discussed above, it possible in some cases to have both qualitative and a quantitative variable. Here, we can use the two- way frequency table as well to discovered association or not. Here, the same traditional way, we need to first classify the quantitative variables into categories and then use the two-way frequency tables to conduct a joint distribution.
In the same way mentioned above, we find out the conditional distributions on the basis of which we infer. They can be depicted graphically in an example mentioned below:
Example: Types and Prices of burgers
Clustered bar graph for the types and pieces of Burgers
In a case, where both the column and row variables are quantitative, the above mentioned way can be used to identify if any kind of association exists amongst the variables. As mentioned above in case of one quantitative variable and one qualitative variable, the quantitative variable was first grouped into categories. In this case, both the variables will be grouped into categories and the present on a two-way frequency table to depict joint distribution. However, this is not the most ideal way to represent the relationship between the variables graphically. Scatterplot is constructed to give the best inference about the correlation and association between variables.