# 401077 Biostatistics: Analysis of Data Set Assignment 2 Answer

401077 Introduction to Biostatistics, Autumn 2019

Assignment 2

Please answer each question in the template document provided and submit via Turnitin on or before the due date. The marks allocated to each question are shown in the assignment. A total of 30 marks are available and this assignment is worth 30% of your overall grade.

All of the questions in this assignment ask you to analyse the data set assigned to you for assignments. This is the same data set which you used for Assignment 1. Read ‘Description of your data set.docx’ for the descriptions of the variables.

This assignment is assessing your skills, not the skills of the computer. You will need to include graphs from R Commander into your assignment but all other R Commander output will attract 0 marks and is discouraged. It is your task to identify the relevant results in the R Commander output and write these up in your assignment.

Some of the assignment questions ask you to show working or justify your answer. Answers without these requested workings or justifications will be awarded 0 marks.

Question 1 (12 marks)

Research question: Does average self-reported weekly income differ between male and female full-time workers in Sydney?

In the following analyses, use R Commander and the assignment data set assigned to you. The variables you will use are ‘sex’ and ‘income’ and ‘log_income’. Each student will get different answers as the data sets differ.

1. Using R Commander, graph the distributions of income for males and females separately. Write a sentence describing the shape of these distributions. Repeat for the logarithm transformed values of income (variable: log_income). (2 marks)
2. The research question could be addressed by conducting either an independent samples t-test on the difference in mean ‘income’ between genders or an independent samples t-test on the difference in mean ‘log_income’ between genders. Which of these two alternatives would you chose? Justify your choice. (2 marks)
3. Using R Commander, implement the analysis you chose in b). Present your results using the 5-step method. (4 marks)
4. The logarithm transformation changes the numeric values of the data but does not change the order of data values. Therefore, a non-parametric hypothesis test (which ranks the data) will give the same answer whether conducted on ‘income’ or ‘log- income’. Complete an appropriate non-parametric test to address the research question. Please use R Commander to do all calculations but present your answer following the 5 step method. (4 marks)

Question 2 (8 marks)

Research question: What is the average self-reported hours worked by full-time workers in Sydney?

In the following analyses, use R Commander and the assignment data set assigned to you. The variable you will use is ‘work’. Each student will get different answers as the data sets differ.

1. Using R Commander, calculate the mean and standard deviation of self-reported hours worked for your sample of full-time workers in Sydney (1 mark
2. Using R Commander and the assignment data set assigned to you, calculate a 95% confidence interval for mean self-reported work hours for full-time workers in Sydney. (Don’t forget to check any assumptions.) (1 mark)
3. Write a sentence which answers the research question above as fully as possible. (2 marks
4. Calculate the margin of error from the confidence interval in c). (1 mark)
5. Using information from the assignment data set assigned to you, estimate the minimum sample size required to produce a 95% confidence interval for mean hours worked by full-time workers in Sydney which has a margin of error of 0.5 hours. Present your answer as a sentence which summarises the required sample size and under what conditions. (3 marks)

Question 3 (10 marks)

Research question: Does highest educational qualification differ by gender among full-time workers in Sydney?

In the following analyses, use R Commander and the assignment data set assigned to you. The variables you will use are ‘educ’ and ‘sex’. Each student will get different answers as the data sets differ.

1. Investigate the relationship between highest education qualification obtained and gender in the assignment data set assigned to you, using a two-way contingency table. Include either row or column percentages and describe the observed relationship in a sentence or two. Obtain the results using R Commander but then type and label the table yourself with appropriate description and headings in your answers. (2 marks)
2. Address the research question using an appropriate hypothesis test on the assignment data set assigned to you. Please use R Commander for all calculations but format your answer following the 5 step method. (4 marks)
3. In the assignment data set assigned to you, what proportion of men have post- graduate degrees? What proportion of women have post-graduate degrees? (1 mark)
4. Suppose we were interested in determining whether or not there was a difference in the proportion of males and proportion of females who have post graduate qualifications. What is the minimum sample size required to detect a difference between two populations proportions, one of which is 0.10 and the other 0.15, with 80% power at the � = 0.05 significance level assuming there will be equal numbers of males and females? Present your answer as a sentence which presents the required sample size under what conditions. (3 marks)

401077 Introduction to Biostatistics, Autumn 2019

Assignment 2

“When submitting your assignment to Turnitin you are implicitly ticking these statements:

1.  I retain a backup file of this assignment in case the original file is lost or damaged.
2. I hereby certify that no part of this assignment or product has been copied from any other student’s work or from any other source except where due acknowledgement is made in the assignment.
3.  I hereby certify that no part of this assignment or product has been submitted by me in another (previous or current) assessment.
4.  I hereby certify that no part of the assignment has been written or produced by any person
5. I hereby certify that no part of this assignment has been made available to any other student.
6.  I am aware that this work will be reproduced and submitted to plagiarism detection software for the purpose of detecting possible plagiarism. This software may retain a copy of this assignment on its database for future plagiarism detection.
7. I understand that failure to uphold this declaration may result in academic proceedings in line with the UWS Student Academic Misconduct Policy.”:

Question 1

a) Density plots of income by gender and log(income) by gender are displayed in figure 1 and figure 2.

Figure 1: Density plot of income by gender

Table 1: Skewness and Kurtosis of density plots

 Income log(income) Male Female Male female Kurtosis 6.6 1.7 2.3 3.013 Skewness 10.4 2.4 0.13 0.08

Figure 1 demonstrates that distribution of income for male and female is almost similar and left tail of distribution is very short and right tail of distribution is very long.

Figure 2: Density plot of log(income) by gender

Figure 2 displays that distribution of log(income) for male and female is more symmetric and closer to normal distribution.

1. part (a) suggests that variable income is not symmetric while distribution of log(income) is more symmetric and closer to normal distribution. So, it is best idea to run an independent sample t-test of difference in mean log(income) between genders.
2. Step     1:     Set     up     hypotheses     and     determine     level     of     significance

Null Hypothesis: average income is equal between male and female

Alternate hypothesis: Average log(income) is different between male and female

And we will use the usual significant level

a = 0.5

Step     2:     Select     the     appropriate     test     statistic

The     appropriate     test     statistic     for     this     hypothesis     is     the t-statistic (two sample t-test).

Step     3:     Set     up     the     decision     criteria

We will reject the null hypothesis if p-value is greater than 0.05.

Step 4 : Compute the test statistic

The test statistic from R commander is t= 0.2553 and p-value is 0.7986

Step 5: Conclusion

p-value is 0.7986 which is greater than 0.05. Therefore, we can’t reject the null

hypothesis. So, true difference in mean of log(income) for male and female is equal to zero.

b) Non-parametric tests don’t assume that out data follow a specific distribution. Mann Whitney test is performed to determine the distribution of income and log(income).

Step     1:     Set     up     hypotheses     and     determine     level     of     significance

Null Hypothesis: average income is equal between male and female

Alternate hypothesis: Average log(income) is different between male and female

And we will use the usual significant level

a = 0.5

Step     2:     Select     the     appropriate     test     statistic

The     appropriate     test     statistic     for     this     hypothesis     is     the Mann Whitney U test statistic.

Step 3:     Set     up     the decision     criteria

We will reject the null hypothesis if p-value is greater than 0.05.

Step 4 : Compute the test statistic

The test statistic from R commander is W= 31130 and p-value is 0.8639

Step 5: Conclusion

p-value is 0.8642, which is greater than 0.05. Therefore, we can’t reject the null

hypothesis. So, true difference in mean of log(income) for male and female is equal to zero.

Question 2

a) Mean and standard deviation of hours worked are

Standard deviation= 6.0627

mean= 40.18273

b) Confidence interval can be given as:

Where,

Calculation is done in R commander.

95% of confidence interval=[39.64895, 40.71651]

c) In Sydney, people work on average 40.18 hours. On an average people deviate from average hours by 6 hours. Notice that confidence interval is very short for workers.

d) We can determine the margin of error by degree of freedom, standard deviation and length of the variable.

Calculation is done in R commander.

margin of error = 0.5337821 hours

e) For a given margin of error and interval, length of a variable can be calculated as

Where,

margin of error(e) = 0.5

standard deviation = 6.0627

a = 0.95

Calculation is done in R commander.

Therefore, best sample size can take the values 1133,972 and 880

Question 3

a)

Table 2: Proportion as row percentage

sex

 Education      male     female postgrad     0.3833    0.6167 bachelor     0.4437    0.5562 certificate  0.7455    0.2544 no tertiary  0.4200    0.5800

Table 3: Proportion as column percentage

sex

 Education      male     female postgrad    0.08646617 0.15948276  bachelor    0.28195489 0.40517241  certificate 0.47368421 0.18534483  notertiary  0.15789474 0.25000000

b) Chi-squared test is conducted to determine whether there is association between gender and education level or not.

Step     1:     Set     up     hypotheses     and     determine     level     of     significance

Null hypothesis: No association between education level and gender

Alternate hypothesis: Association between education level and gender

And we will use the usual significant level

a = 0.5

Step     2:     Select     the     appropriate     test     statistic

The     appropriate     test     statistic     for     this     hypothesis     is     chi-squared test statistic.

Step     3:     Set     up     the     decision     criteria

We will reject the null hypothesis if p-value is greater than 0.05.

Step 4 : Compute the test statistic

The test statistic from R commander is 46.622 and p-value is  4.182e-10.

Step 5: Conclusion

p-value is 4.182e-10, which provides strong evidence to reject the null hypothesis even at 1 percent significance level. Therefore, there is association between education level and gender.

c)

Table displays that 33.33% male have postgrad degree and 62.67 % female have postgrad degree.

d)

Assume there are n individuals in both groups.

Overall proportion (p)= 0.10+0.15 = 0.25

For minimum sample size

Put z=1.96

Therefore, minimum sample size should be 577.