This project leads you through a statistical analysis of used car data. The data for this project was obtained from the car sales website www.carsales.com.au 2 to 10 January 2019 (inclusive).
Part A covers parts of Topics 1 and 2, Part B parts of Topics 5 to 9.
You will need to work on this project throughout Session 1.
Project Data
The data for this project can be accessed from the MySCU site for MAT10251 in Task 2 - Project under ASSESSMENT.
The data set provided contains 10 randomly chosen samples of size 115.
To obtain your data
(1)Click on the Project Data file. This will download an Excel file.
(2)Select the 7 columns (Year to Price) of data for the sample specified by the last digit of your student ID number.
(3)Copy this into a new Excel file.
There are 10 sample data sets each of 7 columns (Year to Price)
Your sample number matches the last digit of your SCU student ID number. For example, if your student ID number ends in 2 your sample is Sample 2 and you will be analysing used car data for Toyota Corolla cars for sale in Western Australia in columns Q2:W120.
Project Situation
An online consumer group Oz-Price-Watch regularly analyses used car prices in various Australian states.
As a research assistant for Oz-Price-Watch, you are analysing the data for the Car and state specified by your sample. For example, if your student ID number ends in 0 your sample is Sample 0 and you will be analysing prices of used Toyota RAV4 (4 cylinder) in New South Wales.
You are required to analyse your sample data in response to the given questions and provide a written answer. You can assume that your written answers are components of a longer report on used car prices.
Project Preparation
You are expected to use Excel when completing the project.
Your written answers presenting your findings and conclusions should be considered as a part of a larger report on used car prices. Each written answer should be a word document into which your Excel output has been copied
In addition, your statistical workings for Part B should appear as appendices to your written answer. This should include all necessary steps and appropriate Excel output.
Each part of the project should be submitted as a SINGLE Word document, with appropriate Excel output added.
Notes
Referencing
You are not required to reference.
However, as the format of your written answers are components of a longer report it may be appropriate to reference. In this case, use any consistent referencing style.
Furthermore, you are not required to use real references. That is, any reference can be fictitious/fake.
You are not required to reference any output or text from Part A that you reuse in Part B.
Project Submission
“Family Name_First Name_Part_A/B/_Campus”
Penalties For
Incorrect Sample
Incorrect Format
MAT10251 STATISTICAL ANALYSIS
PROJECT - PART A
Due Week 4 Tuesday 26 March 2019
If you are a late enrolment in MAT10251, email Nicola Jayne nicola.jayne@scu.edu.au with the date you enrolled in MAT10251 for a revised due date
Value: 10%
Objectives:1 to 5
Topics:1 and 2
Purpose:To
Part A Preliminary Analysis of Sample Data
Oz-Price-Watch has asked you for a preliminary analysis of your sample data. Your calculations and conclusions from this analysis may be incorporated in your answer for Part B
Tasks – Part A Submission
Complete the following
1)Download and save your data.
2)Download the Project Part A cover sheets, name and save this file as
“Family Name_First Name_Part_A_Campus”
3)Enter your Sample Number on page 2 of the Part A coversheets.
4)Statistical Answers: For used cars of the make and model for sale in the state specified by your sample perform the following
Price of two and three year old cars
Using Price (7th column of data) explore prices of 2016 and 2017 used cars, by using Excel to:
Note: The required data for 2016 and 2017 used cars is in the first rows of your sample.
Difference in price between cars for sale privately and those for sale by a used car dealer.
Use Price (7th column of data) and Seller (5th column of data), where Private indicates a private sale and Dealer a sale through a used car dealer, for all 115 cars in your sample to explore if there is a difference in price between the samples by using Excel to:
Hint: Sort data on Seller to obtain two samples. That is, price of used cars sold privately and price of used cars sold through a used car dealer.
Relationship between price and age and between price and odometer reading
Explore the relationship between the price of a used car and its age and also the price of a used car and its odometer reading, by using Age (2nd column of data) and Odometer (3rd column of data) as independent variables with Price (7th column of data) as the dependent variable for all 115 cars in your sample, by using Excel to:
5)Written Answer – Preliminary Analysis
Using the instructions given on pages 4 and 5 of the Part A coversheets, introduce your data and the results of your preliminary investigation of the price of used cars, of the make and model in the state specified by your sample.
This should be three to five pages and 400 to 800 words.
Use an appropriate style, without statistical jargon and equations, to clearly communicate your results.
6)Complete Coversheets 1 and 2, save and submit Part A of the project online using Project Part A link in Submit Project by the due date Tuesday 26 March 2019.
Statistical Calculations
Marks will be deducted if:
Examples
Written Answer – Preliminary Analysis
MAT10251 STATISTICAL ANALYSIS
PROJECT – PART B
Due: Week 11 Sunday 19 May 2019
Value: 25%
Objectives:1 to 5
Topics:5 to 9
Purpose:To apply your knowledge of statistical inference and regression to answer questions about used cars for sale by analysing the data and communicating the results.
Part B Submission
You should submit a single word document consisting of:
Part B Preparation
The graphs, plots and interpretations in Part A may be required in the statistical and written answers in Part B. Therefore, check these and make any required corrections.
While the submission date for Part B is Sunday 19 May 2019, you should be working on Part B during Weeks 6 to 11.
It is recommended that you follow the following timetable:
Task 1 Part B - Appendices Statistical Inference and Regression and Correlation Tasks (38 marks)
The following statistical tasks should appear as appendices to your written answers. These should include all necessary steps and appropriate Excel output.
These appendices should come after your written answer within your single Word document for Part B.
Statistical Inference
Choose a level of significance for any hypothesis tests and a level of confidence for any confidence intervals. Enter these values on page 2 of the Part B coversheets along with the sample number from Part A.
For used cars of the make and model for sale in the state specified by your sample answer the following questions using appropriate statistical inference and regression techniques.
Question 1 – Topic 5 (5.5 marks)
Since many buyers wish to purchase a two or three year old used car Oz-Price-Watch has asked you to provide information on the average price of 2016 and 2017 cars of the make and model for sale in the state specified by your sample.
To enable you to answer this use Price (7th column of your data) for 2016 and 2017 cars only, your output from Part A and an appropriate statistical inference technique to:
Estimate the population mean price of two and three year old used cars of the make and model for sale in the state specified by your sample.
Note: The required data for 2016 and 2017 cars is in the first rows of your sample.
Question 2 – Topic 6 (7.5 marks)
Many buyers believe that white cars are safer since they are more visible. Therefore, they wish to purchase a white car. Oz-Price-Watch has asked you to explore if restricting a purchase to white cars will limit a buyer’s choice. Past research by Oz-Price-Watch has shown that if a search is restricted to a feature, for example colour or transmission, which at most 30% of cars for sale have then buyer choice is limited.
To provide a justified answer to the question use White (6th column of data, where Yes = car for sale is white and No = car for sale is not white) for ALL 115 cars in your sample and an appropriate statistical inference technique to answer the following question
Are more than 30% of used cars of the make and model for sale in the state specified by your sample white?
Hint: Sort data on White to enable you to easily count the number of white cars in your sample.
Question 3 Topic 7 (8 marks)
Oz-Price-Watch wishes to know if there is a difference in price between cars for sale privately and those for sale by a used car dealer.
To provide a justified answer to this question use Price (7th column of data) and Seller (5th column of data) for all 115 cars in your sample, your output from Part A and an appropriate statistical inference technique to answer the following question
Is there a difference in the average price of cars, of the specified make and model for sale in the specified state, for sale privately and by a used car dealer?
Hint: Sort data on Seller to easily obtain two samples – Prices for private sellers and for used car dealers.
Questions 4 and 5 Simple and Multiple Linear Regression (17 marks)
Oz-Price-Watch asks you how the value of a used car, of the specified make and model, depreciates.
To answer this you develop a simple linear regression model to predict price from age or odometer reading and a multiple linear regression model to predict price from age, odometer reading and transmission type. Then, to provide a justified answer to Oz-Price-Watch, choose and interpret the linear model that best fits your data.
Question 4 Simple Linear Regression Model Topic 8
From your results in Part A choose either Age or Odometer as an independent variable, to predict Price.
To explore the relationship between the age or odometer reading of a used car and its price, use your output from Part A and Age or Odometer (2nd or 3rd column of data) as an independent variable with Price (7th column of data) as the dependent variable, for all 115 cars in your sample, to develop and then explore a simple linear relationship between the two variables by:
Note: You can choose either Age or Odometer as the independent variable in this model.
Question 5 Multiple Linear Regression Model Topic 9
To explore what other factors may have an influence on the value of a used car use your output from Part A and Age, Odometer and Transmission (2nd, 3rd and 4th columns of data) as three independent variables with Price (7th column of data) as the dependent variable for all 115 cars in your sample, to develop and then explore the relationship between these four variables by:
Then determine the best model to predict the price of a used car by:
Notes:
Task 2 - Written Answer – Components of a report (12 marks)
For Questions 1, 2, 3 and Questions 4 and 5 combined present the results of your calculations, with your interpretation and conclusions as components of a longer report on used car prices.
Use the instructions given on pages 4 and 5 of the Part B coversheets.
This should be 500 to 1100 words and three to seven pages.
It should be submitted as a Word file with Excel output included.
Make sure you:
In particular, for Questions 4 and 5
Q1: Pivot table was used to generate average prices for 2016 and 2017. Further, apart from the overall average prices, averages were also calculated for seller type and transmission type as reflected in following table:
Row Labels | Average of Price |
2016 | 14,919.84 |
Dealer | 14,840.00 |
Automatic | 14,797.14 |
Manual | 14,990.00 |
Private | 14,991.70 |
Automatic | 15,109.50 |
Manual | 14,913.17 |
2017 | 17,091.53 |
Dealer | 17,041.11 |
Automatic | 17,102.41 |
Manual | 15,999.00 |
Private | 17,999.00 |
Automatic | 17,999.00 |
Grand Total | 16,005.68 |
It can be seen that overall average for 2016 is $14,919.84 and for 2017 is $17,091.53.
Q2: From given data for Mitsubishi make cars in Queensland, we found out number of white and non-white cars:
Colour | Number | % |
Non White | 74 | 64.3% |
White | 41 | 35.7% |
Hence the research hypothesis can be said to be whether the proportion of white cars of Mitsubishi make in Queensland is more than 30%.
We will use z-test for a proportion.
Hence, from above, p^= 0.0357% where n = 115
Assuming α = 0.05, we can state null and alternative hypothesis as:
Null hypothesis H0: p > 0.30
Alternative hypothesis H1: p ≤ 0.30
We will calculate z-statistic and find corresponding value form z-table:
Ƶ = (0.357-0.30)/√(0.30*0.70)/115 = 1.3339
From z-table, the significance level is p = 0.1822 which is greater than our significance level of p = 0.05. Hence, we are unable to reject the null hypothesis.
We can conclude that at significance level of p = 0.05, there is statistically significant evidence that white cars account for more than 30% of the total cars available of Mitsubishi make in Queensland.
Q3: From given data for Mitsubishi make cars in Queensland, we found out number, average price and standard deviation in prices of cars sold by dealer and private sellers:
Seller | Number | Average Price | SD |
Dealer | 71 | 12,408.93 | 4,243.81 |
Private | 44 | 11,325.45 | 4,055.51 |
Hence the research hypothesis can be said to be whether the average price of cars sold by dealers is equal to price of cars sold by private sellers for Mitsubishi make cars in Queensland.
Assuming α = 0.05, we can state null and alternative hypothesis as:
Null hypothesis H0: µ1 = µ2
Alternative hypothesis H1: µ1 ≠ µ2
We will use t-test for Two-Sample assuming equal variances in excel. The output is as follows:
From above we can see that the p-value is 0.1787 which is greater than our significance level of p = 0.05. Hence, we are unable to reject the null hypothesis.
We can conclude that at significance level of p = 0.05, there is no statistically significant evidence to reject null hypothesis. Hence, we can conclude that the average price of cars sold by dealers is equal to price of cars sold by private sellers for Mitsubishi make cars in Queensland.
We get same result if we use the test assuming unequal variances:
From above we can see that the p-value is 0.1746 which is greater than our significance level of p = 0.05. Hence, we are unable to reject the null hypothesis.
We can conclude that at significance level of p = 0.05, there is no statistically significant evidence to reject null hypothesis. Hence, we can conclude that the average price of cars sold by dealers is equal to price of cars sold by private sellers for Mitsubishi make cars in Queensland.
Q4:
Simple linear regression
The dependent variable (y), Car price was regressed on the independent variable (x), Odometer reading by using Regression in MS-excel:
It can be seen in above output that value of R is 0.83 and R2 is 0.69 indicating that model is a reasonably good fit with around 69% of the change in car price being attributed to the odometer reading. We can interpret the least squares regression equation as follows:
y^ = 17,388.03 – 0.06 x1
where,
We can see that there is inverse relationship between the variables such that as odometer reading increases, car price decreases and vice versa. The above equation has a constant of 17,388.03 which is also known as Y-intercept coefficient.
The p-values indicate statistical significance of various variables. We can see that the p-value for Odometer reading is p = 0.0000 which is less than p = 0.05. Hence, this variable is statistically significant.
Further, from confidence interval data, we can be 95% confident that with each increase in odometer reading (kms), the car price will decrease between -0.07 and -0.05.
Q5:
Multiple linear regression
The dependent variable (y), Car price was regressed on the independent variables (x), Age, Odometer reading (kms) and Transmission, by using Regression in MS-excel:
In above, we had to convert Transmission data to numeric data by using following connotation: Automatic was denoted as ‘1’ and Manual was denoted as ‘0’.
It can be seen in above output that value of R is 0.92 and R2 is 0.84 indicating that model is a very good fit with around 84% of the change in car price being attributed to the age, odometer reading (kms) and transmission. We can interpret the least squares regression equation as follows:
y^ = 17,163.31 – 625.68 x1 – 0.02 x2– 1,301.91 x3
where,
The above equation has a constant of 17,163.31 which is also known as Y-intercept coefficient.
We can see that there is inverse relationship between the variables such that as odometer reading or age increases, car price decreases and vice versa.
The p-values indicate statistical significance of various variables. We can see the p-values for various variables as follows:
x1 Age | 0.0000 |
x2 Odometer (kms) | 0.0002 |
x3 Transmission | 0.0002 |
In each case, the p-value is less than p = 0.05. Hence, each of these variables are statistically significant.
Further, from confidence interval data,
x1 Age | -761.14 | -490.23 |
x2 Odometer (kms) | -0.03 | -0.01 |
x3 Transmission | 635.68 | 1,968.14 |
We can be 95% confident that:
Comparison
In both cases, all the variables were statistically significant. In both cases, the p-value for regression is p = 0.000 which is less than significance value of p = 0.05. Hence, both the models were statistically significant.
When we compare above two models, the parameter can be value of R2. Higher value indicates better fit and vice versa. The simple linear regression model had R2 Value of 0.69 while the multiple linear regression model had R2 Value of 0.84.
Hence, we can say that the multiple linear regression model is a better fit as compared to simple linear regression due to higher value of R2 which indicates that almost 84% of change in car price can be explained by the three selected variables, namely age, odometer reading (kms) and transmission type.