# POPH90013 Biostatistics: Assignment 1 Answer

Biostatistics POPH90013

Assignment 1

Question 1 [Total 14 marks] Note: This question does not require Stata. The two Figures in this question were originally published in the article by Benhamou AH et al. “Correlation between specific immunoglobulin E levels and the severity of reactions in egg allergic patients”, Pediatr Allergy Immunol (2008), issue 19, pp. 173–179. The authors used clinical data of 51 oral food challenges to egg, raw or cooked, performed between January 2003 and December 2005. The aim of this study was to determine whether specific immunoglobulin E (IgE) titres were associated with the severity of the reaction during a standardized egg challenge. Serum was obtained for quantification of egg white IgE antibody titres on the day of the food challenge, or within a time range of <6 month before. For illustration purposes, the egg white specific IgE levels were used after the log transformation.

a) [6 marks] Describe how the distribution of log egg white IgE differs between the three groups in Figure 1. Limit comments to comparing the location, spread and maximum/minimum levels of log egg white IgE.

Note: There is no need to report the numerical value of the summary statistics you use; instead you should refer to the name of the summary statistic you are comparing (e.g., median increases/decreases).

b) [2 marks] From Figure 1, which group (no reaction, moderate or severe) has a skewed distribution for log egg white IgE? How can you tell from the box and whisker plot?

c) [4 marks] Using Figure 1 only, draw three new box and whisker plots for egg white IgE, instead of log egg white IgE, in patients with no clinical reaction, moderate reactions, and severe reactions during the food challenge.

From the three new box and whisker plots, which groups (no reaction, moderate or severe) have a skewed distribution for egg white IgE?

Hint: Use the box and whisker plots in Figure 1 to estimate (i.e. make an educated guess based on what you see in the visual display of data) the relevant values of the median, 25% and 75% percentiles, maximum and minimum values for each plot. Then, to get the egg white IgE, “remove” the log transformation by taking the exponent of these values (i.e., by using the e or the exp button on the calculator).

d) [2 marks] Based only on Figure 2 below, if you see a patient result with value of log egg white IgE equal to 1.1 KU/l, is it more likely that the patient was exposed to a raw egg or a cooked egg? Why?

Question 2 [Total 10 marks] Note:

This question does not require Stata. Table 1 (see below) gives the results of a randomised controlled trial comparing two types of weekly therapeutic venesection (blood removal) for the treatment of severely elevated serum ferritin (SF), a biochemical marker of iron overload disease. The first therapy was whole blood donation (call this therapy “Whole Blood”). The second therapy was “plasmapheresis” where, after the initial removal and separation of whole blood, the blood cells are returned to the body, so this therapy involves the net removal of plasma only (call this therapy “Plasma Only”). The aim of the trial was to determine whether removal of bloods cells and plasma (“Whole blood”), rather than just plasma alone (“Plasma Only”), was more effective in reducing SF. Investigators based their statistical analysis of the data from this trial on a two-group (“Whole Blood” and “Plasma Only”) comparison of sample means of SF. A difference in mean SF of 100 ng/L corresponds to a clinically relevant effect.

a) [4 marks] Using the data presented in Table 1, calculate and interpret a 95% confidence interval for the population mean difference in SF between the two therapies.

b) [2 marks] Using the data presented in Table 1, calculate and interpret the P value for the null hypothesis that there is no difference in population mean SF between the two therapies.

c) [4 marks] What is your interpretation of the results of part (a) and part (b) above? Specifically, what do: (i) the difference between sample mean SF; (ii) the 95% confidence interval for the difference between the population mean SF; and (iii) the P value tell us about the effectiveness of “Whole Blood” versus “Plasma Only” as a treatment for severely elevated SF?

Question 3 [Total 6 marks] Note: This question does not require Stata. A complete version of Table 2 (see below) was originally published in the article by Ghilotti F et al. “Obesity and risk of infections: results from men and women in the Swedish National March Cohort”, International Journal of Epidemiology, Vol. 48, Issue 6, December 2019, Pages 1783–1794.

Assume that height is normally distributed within the sample and within the population for each BMI category and that the sample mean, and the sample standard deviation computed in part (a) below are reasonable estimates of the corresponding population parameters.

Question 4 [Total 30 marks]

To answer this question, you will need to use the Stata dataset vitaminD.dta which can be downloaded from the folder “Assignment 1” in the Assessment area on Canvas. Vitamin D is critical for regulation of important minerals found in the human body and is obtained from exposure to the sun. Insufficient levels of vitamin D (i.e., vitamin D deficiency) have been linked to increased risk of cardiovascular disease, cancer and asthma, among other conditions. In general, the best way to measure the concentration of vitamin D in a human body is through a blood test. A blood level of vitamin D that is less than 30 nmol/L is considered a serious deficiency. This dataset on vitamin D levels comes from a book "Regression with Linear Predictors" by Per Kragh Andersen and Lene Theil Skovgaard published in 2010 and has been modified from the original. The dataset is a subset of a large cross-sectional observational study on vitamin D concentration conducted in Europe. The data set contains 6 variables and 213 observations. A detailed description of all the variables in the data is below:

The study investigators are interested in addressing the following research question: “To estimate the difference in population mean vitamin D concentration between overweight/obese individuals and individuals in the normal BMI range, separately for (i) individuals who prefer the sun, and (ii) those that avoid the sun.” a) [1 mark] What type of data is each variable in Table 3? b) [1 mark] Complete all entries in the table below.

c) [2 marks] How many data values are missing for the variable vitd for each category of bmicat (e.g., of all the individuals that are overweight/obese, what is the number of data values missing for the variable vitd)?

d) [2 marks] What is the mean vitamin D level (vitd) of all female participants? What is the mean vitamin D level (vitd) of all female participants with normal BMI?

e) [2 marks] How many males in the data set have vitamin D levels (vitd) greater than 30 nmol/L? What is their median age?

f) [1 mark] What category of sun exposure is most common in female participants with vitamin D level less than 50 nmol/L?

g) [1 mark] Of all individuals that are in the normal BMI category, what percentage are female and what percentage are male?

h) [1 mark] What percentage of all individuals in the data set are males who avoid the sun? What percentage of all individuals in the data set are males who prefer the sun?

i) [2 marks] Use Stata to produce a histogram of vitamin D levels (vitd) for each of the four categories of BMI (bmicat) and sun exposure (sunexp); that is, one histogram for each of the following four strata: • normal BMI category and avoids the sun; • normal BMI category and prefers the sun; • overweight/obese BMI category and avoids the sun; • overweight/obese BMI category and prefers the sun. Look up the help file for the histogram function for the relevant options to make the following changes to the graph: • use the by()option to display the four histograms in a single plot; • use 10 bars (or bins) per histogram; • add a plot of the normal distribution to each histogram; Copy the graph directly into your assignment document by clicking on edit/copy in the Stata graph window. You may also use the “File -> Save As” feature in Stata to save the graph as an image that you can import into Microsoft Word.

j) [2 marks] Use Stata to produce an appropriate graph to display the relationship between vitamin D level (vitd) and sun exposure category (sunexp). Copy the graph directly into your assignment document by clicking on edit/copy in the Stata graph window. You may also use the “File -> Save As” feature in Stata to save the graph as an image that you can import into Microsoft Word. Based only on this graph, which sun exposure category has, on average, a higher median vitamin D level (vitd)? 9

k) [4 marks] Provide a table that summarises the distribution (sample size, range, mean, standard deviation, 25th / 50th / 75th percentiles, standard error of the sample mean) of vitamin D level (vitd) for each of the four strata defined by BMI category and sun exposure. Recall that the strata are: • normal BMI category and avoids the sun; • normal BMI category and prefers the sun; • overweight/obese BMI category and avoids the sun; • overweight/obese BMI category and prefers the sun. Ensure that the table is formatted properly (please do not copy and paste directly from Stata output).

l) [1 mark] Serious vitamin D deficiency may be classified as a serum vitamin D level less than 30 nmol/L. Generate a new binary variable in Stata called vitd_def that classifies each person as having normal (coded as 0) or deficient (coded as 1) vitamin D level. vitd (nmol/L) vitd_def ≥ 30 Normal< 30 Deficient What is the observed proportion of individuals with serious vitamin D deficiency in each of the four strata defined by BMI category and sun exposure category? (e.g., Of all the individuals who have normal BMI and avoid the sun, what proportion have serious vitamin D deficiency?)

m) [4 marks] Assume that you do not have access to the individual participant data for this study but are instead given only the summary statistics that you calculated in part (k). Using the normal distribution, estimate the proportion of individuals with vitamin D deficiency (i.e., vitd < 30 nmol/L) in each of the four “BMI category by sun exposure category” strata. Assume the following: • vitamin D concentration is normally distributed within each stratum; and • the sample mean and sample standard deviation calculated in part (k) for each stratum are good estimates of the corresponding population parameters. How do these proportions based on the normal curve compare with the observed sample proportions obtained in part (l)? Is the assumption of a normal distribution appropriate for each stratum?

n) [2 marks] Calculate and interpret the difference in sample mean vitamin D level between individuals in the normal BMI category and individuals in the overweight/obese BMI category, separately for: individuals who prefer the sun, and individuals that avoid the sun.

o) [2 marks] Calculate and interpret a 95% confidence interval for the difference in population mean vitamin D level between individuals in the normal BMI category and individuals in the overweight/obese BMI category, separately for: individuals who prefer the sun, and individuals that avoid the sun.

p) [2 marks] Calculate and interpret a P value for the null hypothesis that there is no difference in population mean vitamin D level between individuals in the normal BMI category and individuals in the overweight/obese BMI category, separately for: individuals who prefer the sun, and individuals that avoid the sun.