MATH1041 Statistics for Life and Social Science: Data Analysis Assessment Answer
MATH1041 Statistics for Life and Social Science
Term 1, 2020
Data: A data set (in the text fifile format) will be sent to you via email at your offiffifficial university email address (see page 2 of this document for further details).
Assignment length: No more than SIX single-sided A4 pages including this cover sheet as the fifirst page. Also, please make sure that you include your name and zID somewhere in the assignment.
Obtaining the data via email and reading it into RStudio
The data (that is, your data set) are available in a text fifile with a name similar to: “ z1234567.txt ”, (where z1234567 in the text fifile name is replaced by your unique student zID number). This text fifile has been sent to you via email at your offiffifficial university email address. PLEASE CHECK YOUR UNIVERSITY EMAILS
REGULARLY TO MAKE SURE THAT YOU HAVE OBTAINED YOUR
DATA SET. Please email Dr Jakub Stoklosa (email@example.com ) if haven’t received your data set yet. The fifirst step is to read the data into RStudio. The data format is simple and similar to what you have already done in the Introduction labs. Follow the instructions given in section R1.4 “How to import a text fifile into RStudio”of the RStudio “How-To-Manual” available on Moodle. Once you’ve uploaded the data then you are ready to start your analysis!
Computing assignment format
Here are some more details that may assist you:
• Regarding the overall assignment structure, please answer all questions in the given
order (that is, 1a), b), etc.). You don’t need to re-write the assignment questions
again. Keep your answers brief, clear and concise.
• You are required to type up your entire assignment (rather than scanning and taking
screenshots), including any equations. If you are using Word you should use the
equation editor for any maths notation. You can download Word for free, see:
• Please convert and submit your assignment in pdf.
• We recommend adding some working out for some of the questions involving calculations just in case you get the wrong answer you have some workings to gain marks from. But try to keep your solutions brief and concise (since there is a page limit). Depending on what the question is asking, your working could consist of RStudio commands or perhaps the main steps on how you arrived at your answer. You don’t need to add all of your R-code!
• Keeping your results to 2 or 3 decimal places should be fifine.
• There is no requirement for font size and line spacing but obviously don’t make things too small.
A group of research ecologists were interested in studying the impacts of climate change on difffferent species of plants that grow in South Australia. Some of these plants are native
to Australia while others are considered to be non-native (exotic). To obtain their data, the research team decided to collect a random sample of plants from a national park. Some measurements were then taken on each plant. The random sample of data consists of plant height measurements (measured in centimeters), dry weight measurements (measured in grams), whether the plant was native or non-native to Australia and the polinization mode of the plant (this could one of four types: wind, water, insect and self-polinization). The text fifile contains your unique data of length n in separate rows consisting of 4 variables: Height which corresponds to the heights, Weight which corresponds to dry weight of a plant, Type which corresponds to plant type (native = 0 and exotic = 1), and ]Polin which corresponds to the polinization mode of the plant (Wind, Water, Insect and Self). Your job is to assist the research team by analysing the data set provided to you.
The Analysis Tasks
The questions you need to answer in your assignment submission are given below. Please make sure your assignment is converted to pdf format.
1. (a) Calculate the sample mean and sample standard deviation of your plant height (Height) measurements.
(b) Produce a normal quantile plot of your sample of plant height measurements (see Section R2.6 “How to produce a normal quantile plot using RStudio”). Include this plot in your submitted assignment, properly labelled.
(c) By referring to the normal quantile plot obtained in Part 1b brieflfly discuss if the plant heights are approximately normally distribution.
2. Let µ be the population mean plant height (in centimeters) of plant heights in the national park now (Summer, 2020). The research team decided to compare the current plant height mean with the mean from 20 years ago using plant height data obtained from the same national park. The known mean plant height from 20 years ago was 190 centimeters. (a) Test the hypothesis whether µ has changed from 190 centimeters. You must summarize all steps: state the null (H0) and alternative hypotheses (Ha) relevant to the research objectives stated in this scenario, the value of a suitable test statistic, the sampling distribution for this statistic, a P-value, your summary of signifificance and conclusion in plain language.
3(b) Some assumptions need to be made for the sampling distribution of the test statistic (as given in Part 2a) to be valid. State these assumptions, and brieflfly discuss whether these assumptions are satisfified.
(c) Produce a 95% confifidence interval for µ, the mean heights. For this question you may assume that it is appropriate to use a t-distribution. Make sure you write down all the required steps to calculate this interval.
(d) Does your confifidence interval (constructed in Part 2c) include the value 190 centimeters?
(e) Explain whether your confifidence interval (constructed in Part 2c) is consistent with your conclusions from the hypothesis test in Part 2a.
3. The research team were also interested in studying the relationship between:
• Plant type and height
(a) Produce a comparative boxplot for plant type against height. Include this plot in your submitted assignment, properly labelled.
(b) Describe any difffferences or similarities in the distribution of plant height for the difffferent types (native or exotic) using your comparative boxplot from Part 3a. Include in your answer comments on shape, location, and spread.
• Plant type and polinization mode
(c) Construct an appropriate numerical summary for the plant type and polinization mode. Brieflfly describe any difffferences or similarities of plant type and polinization mode from this numerical summary.
• Plant height and weight
(d) Construct an appropriate graphical summary to visualize the relationship between plant height and weight. Include this plot in your assignment, properly labelled.
(e) Summarize the key features of your plot from Part 3d.
(f) Suggest an appropriate numerical summary to quantify the strength of the linear relationship between plant weight and height. Report and brieflfly comment on this value.
(g) The research team wanted to predict plant weight from plant height measurements by fifitting a linear regression model. Would you recommend the research team do this? Explain brieflfly. You are not required to carry out any prediction in this question.
4. The research team decided to investigate the plant weight (Weight) measurement in more detail.
(a) Produce a histogram for the Weight measurements. Include this histogram in your submitted assignment properly labelled.
4(b) Comment on the shape (skewness/symmetry) of your histogram from Part 4a.
(c) A common technique that can be used to remove skewness in data is known as a log-transformation. That is, for each value in your data (denoted by xi), you can log-transform it as yi = log(xi). The function in RStudio that performs a log-transformation on a value is log(). Produce a histogram for the log(Weight) measurements. Include this new histogram in your submitted assignment properly labelled. (d) Do you think this log-transformation reduced any skewness? Explain brieflfly.