BUS708 Statistics and Data Analysis
Statistical Modelling Assignment
1 OVERVIEW OF THE ASSIGNMENT
This assignment will test your skills of collecting and analysing data to answer a specific business problem. It also gives you the opportunity to apply the theories you have learned in this course such as finding numerical summaries, displaying with appropriate graphs and using statistical inferences to solve business problems, including constructing hypotheses, test them and interpret the findings. You may have to use two Data sets. One Data set will be sent to you via KOI student email individually and you need to find or collect another dataset.
Suppose you are working for an agency who analyse NSW transport system data to make a recommendation to improve public transport system. You will be given series of research questions. Use your knowledge that you gain from this course to answer these questions by displaying appropriate outputs of Excel, StatKey or Wolfram alpha. Use these answers to write an executive summary which might be a valuable recommendation to Transport NSW.
2 TASK DESCRIPTION: WRITTEN REPORT
There are two datasets involved in this assignment: Dataset 1 and Dataset 2, detailed below.
Dataset 1: You will receive an email that contains a dataset that is specifically allocated to you. This dataset is a subset of a data Opal Tap on and Tap Off Location - 8th to 14th August 2016 individual sample file, provided by the Transport for NSW Open Data and has been edited to only include a subset of the cases and variables. The original dataset can be obtained from https://opendata.transport.nsw.gov.au/dataset/opal-tap-on-and-tap-off and it is under the license of Creative Commons Attribution 3.0 Australia. Data dictionary of the edited dataset is given in the following table.
Variable | Description | Values |
mode | Type of the public transport | Bus, Train, Ferry and Light Rail |
date | Date of the tap on/off held | Date/month/year |
tap | It is a tap on or off | On and Off |
loc | Locations of stops. For bus | Postcodes and names of the stations |
postcodes and others name | ||
of the stations | ||
count | Total number tap on or off | Number |
on the certain location and | ||
the certain date |
Dataset 2: Collect data (e.g. via a survey) that will answer research question given in section 3. There is no requirement about the number of variables, sampling methods and sample size, but you need to justify your approaches in Section 1 (see below).
Both datasets should be saved in an Excel file (one file, separate worksheets). All data processing should be performed in Excel or Statkey (http://www.lock5stat.com/StatKey).
Prepare a report in a document file (.doc or .docx) which includes all relevant tables and figures, using the following structure:
Section 1: Introduction
Section 2: Analysis of single variable in Dataset 1
NSW people during 8th to 14th of August 2016?”, provide a suitable numerical summary and graphical display for the variables mode of Dataset 1. Give a detailed comment to answer the research question.
Section 3: Analysis of two variables in Dataset 1
NSW Government need to decide on whether they have to build an underground Railway line from either Parramatta, Bankstown or Gosford to central. To prepare a recommendation for this;
Section 4: Collect and analysis Dataset2
You are interested in finding whether there is a difference in preference between different gender in terms of their transport mode (Bus, Train, Ferry and Light Rail). by considering appropriate number of cases and variable, give a proper graphical display and use it to write a comments.
Section 5: Discussion & Conclusion
Write an executive summary by combining all your findings in the previous sections which must be a valuable recommendation for NSW Transport. Give a suggestion for further research
3 TASK DESCRIPTION: PRESENTATION/INTERVIEW
A presentation/interview for the assignment is scheduled on Week 11, in your allocated tutorial.
You do NOT need to prepare a presentation material (e.g. power-point slides), instead, you will be asked to demonstrate and/or explain how you summarised the data and how you performed the analysis. You may be asked to reproduce what you have made in your written report (e.g. generate a chart or numerical summary using Excel or Statkey).
The following assignment will discuss the provided dataset that provides insight into various modes of transportation sued in NSW and the patterns therein. The data will be analysed to understand the distribution of users amongst various modes of transportation, namely bus, train, ferry and lightrail. Further, the data provides breakup basis various locations and stations as well.
Another dataset selected is based on the train statistics for NSW available on the government’s transportation website (Bureau of Transport Statistics, 2014). The data provides insight into journey for work by various passengers and respective share of train journey. The data is divided basis various centres and provides total trips as well as railway share. Further, the data provides these numbers on basis of including and excluding ‘walk only’ so as to get a better idea of use of transportation as well as journeys covered on foot.
The citation for the article is: Bureau of Transport Statistics (2014) ‘Train Statistics 2014: Everything you need to know about Sydney Trains and NSW TrainLink’ [online] Accessed from: https://www.transport.nsw.gov.au/sites/default/files/media/documents/2017/Train%20Statistics%202014.pdf
The dataset 1 provided is primary data which is a subset of the main data available on government website. The subset covers the period of 8th August to 14th August, 2016.
The provided data covers count of passengers on basis of transportation mode used (train, bus, lightrail and ferry) and also provides breakup basis various locations. Further, it also provides information about whether the tap is held on or off for each location.
The dataset 2 is also primary data that has been collected basis Australian census 2011. The data selected is based on the train statistics for NSW available on the government’s transportation website (Bureau of Transport Statistics, 2014). The data provides insight into journey for work by various passengers and respective share of train journey. The data is divided basis various centres and provides total trips as well as railway share. Further, the data provides these numbers on basis of including and excluding ‘walk only’ so as to get a better idea f use of transportation as well as journeys covered on foot.
For this purpose, a pivot table was created by dividing various modes of transportation and sum total of the provided count.
This was used to create a pie chart as follows:
It is clear that the ‘train’ mode of transportation is the most widely used that accounts for as much as 60% of the total count of the provided subset.
Section 2b: Hypothesis Testing
The breakup of various modes of transportation is provided as follows:
It can be seen that the ferry and lightrail form less than 3% each and hence, do not contend for chosen mode by more than 50% of public transport users in NSW. The remaining two modes, bus (36.6%) and train (59.7%) account for majority users
For this, the counts of bus and train were separated in two columns. The mean and variance of the two were calculated using excel formulae. Then, z test for two samples was used and output is as follows:
From above, it can be seen that one tailed P value is much less than the alpha of 0.05. Hence, we reject the null hypothesis that the mean difference in the counts of bus and train is same. In other words, we can say that there is significant difference between means of count of bus and train and that there is evidence that count of train accounts for more than 50% of users during the period.
For this purpose, a pivot table was created with various modes and locations. Then, a filter was put on ‘train’. Further, the pivot was used to draw out data for the three stations using vlookup:
This data was used to create a bar chart as follows:
From the bar chart above, it is clear that the Parramatta station has maximum number of passengers at 1,578 (or, 2.6% of total train users). Central station is already busy at 3,997 passengers. Hence, it seems that the most suitable underground line will be Parramatta to Central Station so as to ease the heavy footfall.
For this purpose, single factor ANOVA test was conducted in MS-Excel. The ‘count’ data for ‘on’ and ‘off’ was separated out in two columns (only for trains) and the test was run at alpha of 0.05. The hypothesis is:
H0: μon = μoff
H1: μon ≠ μoff
The result is as follows:
From above, we can see that the value of F (2.47) < F crit (3.86). Hence, we are unable to reject the null hypothesis. Concluding, we can say that there is no significant difference in the means of ‘tap on’ and ‘tap off’.
We saw that out of the three stations selected for the purpose, Bankstown, Gosford and Parramatta, the heaviest footfall is at Parramatta. Further, the Central Station is very busy at even a heavier footfall. We also saw that the means of tap on and tap off do not vary significantly at the given alpha level of 0.05.
Hence it is recommended to build an underground railway line from Parramatta Station to Central Station so as to ease the heavy footfall.
In above report, it was seen that during the period of 8th August to 14th August, 2016, the type of public transport utilised in NSW included bus, train, lightrail and ferry. Further, it was seen that very less percentage of passengers used ferry (2%) and lightrail (1%). The majority of people relied on buses (37%) and trains (60%).
It was seen that top 25 train stations accounted for almost 69% of the total train users whereas top 25 bus locations accounted for almost 53% of the total bus users during the period. The data is presented in graphs below (count on x-axis):
This indicates that the traffic can be much better managed if focus is on top 20 or top 25 locations instead of all of them.
It was found that there is significant difference between means of count of bus and train, indicating that train usage is much more than bus and may be accounting for more than 50% of the transport users.
Hence, focus is on improving traffic management at train stations. The Central Station (6.57%) is the second busiest station (after Town Hall Station at 10.09% of train users) as per the provided data. Analysis was done to understand the need for underground railway line from Central Station to one of the three selected stations (corresponding percentage of users in parenthesis): Parramatta (2.59%), Bankstown (0.80%) and Gosford (0.86%).
Looking at the data, it can be seen that Parramatta is the busiest of the three with highest footfall. Hence, an underground line will be most beneficial from Parramatta to Central Station.
The data provided categorization basis tap on and tap off as well but it was seen through hypothesis testing that there is no significant difference between the means of tap on and tap off. Hence, it seems that the categorization can be ignored for the purpose of analysis without any significant impact on the findings.