# Dataset Analysis on Modes of Transportation Sued in NSW and the Patterns

BUS708 Statistics and Data Analysis

Statistical Modelling Assignment

1  OVERVIEW OF THE ASSIGNMENT

This assignment will test your skills of collecting and analysing data to answer a specific business problem. It also gives you the opportunity to apply the theories you have learned in this course such as finding numerical summaries, displaying with appropriate graphs and using statistical inferences to solve business problems, including constructing hypotheses, test them and interpret the findings. You may have to use two Data sets. One Data set will be sent to you via KOI student email individually and you need to find or collect another dataset.

Suppose you are working for an agency who analyse NSW transport system data to make a recommendation to improve public transport system. You will be given series of research questions. Use your knowledge that you gain from this course to answer these questions by displaying appropriate outputs of Excel, StatKey or Wolfram alpha. Use these answers to write an executive summary which might be a valuable recommendation to Transport NSW.

There are two datasets involved in this assignment: Dataset 1 and Dataset 2, detailed below.

Dataset 1: You will receive an email that contains a dataset that is specifically allocated to you. This dataset is a subset of a data Opal Tap on and Tap Off Location - 8th to 14th August 2016 individual sample file, provided by the Transport for NSW Open Data and has been edited to only include a subset of the cases and variables. The original dataset can be obtained from https://opendata.transport.nsw.gov.au/dataset/opal-tap-on-and-tap-off and it is under the license of Creative Commons Attribution 3.0 AustraliaData dictionary of the edited dataset is given in the following table.

 Variable Description Values mode Type of the public transport Bus, Train, Ferry and Light Rail date Date of the tap on/off held Date/month/year tap It is a tap on or off On and Off loc Locations of stops. For bus Postcodes and names of the stations postcodes and others name of the stations count Total number tap on or off Number on the certain location and the certain date

Dataset 2: Collect data (e.g. via a survey) that will answer research question given in section 3. There is no requirement about the number of variables, sampling methods and sample size, but you need to justify your approaches in Section 1 (see below).

Both datasets should be saved in an Excel file (one file, separate worksheets). All data processing should be performed in Excel or Statkey (http://www.lock5stat.com/StatKey).

Prepare a report in a document file (.doc or .docx) which includes all relevant tables and figures, using the following structure:

Section 1: Introduction

1. Give a brief introduction about the assignment and search related article and write a paragraph of summary which supports your assignment. You need to give the full citation of the article.
2. Dataset 1: Give a short description about this dataset. Is this primary or secondary data? What are types of variables involved? Explain briefly what are the possible cases used in this study.
3. Dataset 2: Explain how you collect the data and discuss its limitation (e.g. whether your sample is biased). Is this primary or secondary data? What is/are the type(s) of variable(s) involved? Give a description of cases you consider for this data set.

Section 2: Analysis of single variable in Dataset 1

1. To answer research question “Which type of public transport was most used by the

NSW people during 8th to 14th of August 2016?”, provide a suitable numerical summary and graphical display for the variables mode of Dataset 1. Give a detailed comment to answer the research question.

1. Now to answer research question “Are there more than 50% of public transport users in NSW use the particular mode of transport found in Part a?” setup an appropriate hypotheses, perform hypotheses test and answer the research question by writing the conclusion of the test.

Section 3: Analysis of two variables in Dataset 1

NSW Government need to decide on whether they have to build an underground Railway line from either Parramatta, Bankstown or Gosford to central. To prepare a recommendation for this;

1. Give a numerical summary and an appropriate graphical display for the variables location, by only considering those three stations; and the variable count by considering the data with trains only.
2. Perform a suitable hypothesis test at a 5% level of significance to test whether there is difference between mean counts of taps on and off.
3. Use the conclusion of the test in part b and the outputs in part a to write a recommendation to NSW government.

Section 4: Collect and analysis Dataset2

You are interested in finding whether there is a difference in preference between different gender in terms of their transport mode (Bus, Train, Ferry and Light Rail). by considering appropriate number of cases and variable, give a proper graphical display and use it to write a comments.

Section 5: Discussion & Conclusion

Write an executive summary by combining all your findings in the previous sections which must be a valuable recommendation for NSW Transport. Give a suggestion for further research

A presentation/interview for the assignment is scheduled on Week 11, in your allocated tutorial.

You do NOT need to prepare a presentation material (e.g. power-point slides), instead, you will be asked to demonstrate and/or explain how you summarised the data and how you performed the analysis. You may be asked to reproduce what you have made in your written report (e.g. generate a chart or numerical summary using Excel or Statkey).

# Section 1a: Introduction

The following assignment will discuss the provided dataset that provides insight into various modes of transportation sued in NSW and the patterns therein. The data will be analysed to understand the distribution of users amongst various modes of transportation, namely bus, train, ferry and lightrail. Further, the data provides breakup basis various locations and stations as well.

Another dataset selected is based on the train statistics for NSW available on the government’s transportation website (Bureau of Transport Statistics, 2014). The data provides insight into journey for work by various passengers and respective share of train journey. The data is divided basis various centres and provides total trips as well as railway share. Further, the data provides these numbers on basis of including and excluding ‘walk only’ so as to get a better idea of use of transportation as well as journeys covered on foot.

The citation for the article is: Bureau of Transport Statistics (2014) ‘Train Statistics 2014: Everything you need to know about Sydney Trains and NSW TrainLink’ [online] Accessed from: https://www.transport.nsw.gov.au/sites/default/files/media/documents/2017/Train%20Statistics%202014.pdf

# Section 1b: Dataset 1

The dataset 1 provided is primary data which is a subset of the main data available on government website. The subset covers the period of 8th August to 14th August, 2016.

The provided data covers count of passengers on basis of transportation mode used (train, bus, lightrail and ferry) and also provides breakup basis various locations. Further, it also provides information about whether the tap is held on or off for each location.

# Section 1c: Dataset 2

The dataset 2 is also primary data that has been collected basis Australian census 2011. The data selected is based on the train statistics for NSW available on the government’s transportation website (Bureau of Transport Statistics, 2014). The data provides insight into journey for work by various passengers and respective share of train journey. The data is divided basis various centres and provides total trips as well as railway share. Further, the data provides these numbers on basis of including and excluding ‘walk only’ so as to get a better idea f use of transportation as well as journeys covered on foot.

# Section 2a: Type of Mode

For this purpose, a pivot table was created by dividing various modes of transportation and sum total of the provided count.

This was used to create a pie chart as follows:

It is clear that the ‘train’ mode of transportation is the most widely used that accounts for as much as 60% of the total count of the provided subset.

Section 2b: Hypothesis Testing

The breakup of various modes of transportation is provided as follows:

It can be seen that the ferry and lightrail form less than 3% each and hence, do not contend for chosen mode by more than 50% of public transport users in NSW. The remaining two modes, bus (36.6%) and train (59.7%) account for majority users

For this, the counts of bus and train were separated in two columns. The mean and variance of the two were calculated using excel formulae. Then, z test for two samples was used and output is as follows:

From above, it can be seen that one tailed P value is much less than the alpha of 0.05. Hence, we reject the null hypothesis that the mean difference in the counts of bus and train is same. In other words, we can say that there is significant difference between means of count of bus and train and that there is evidence that count of train accounts for more than 50% of users during the period.

# Section 3a: Recommendation for Underground Line

For this purpose, a pivot table was created with various modes and locations. Then, a filter was put on ‘train’. Further, the pivot was used to draw out data for the three stations using vlookup:

This data was used to create a bar chart as follows:

From the bar chart above, it is clear that the Parramatta station has maximum number of passengers at 1,578 (or, 2.6% of total train users). Central station is already busy at 3,997 passengers. Hence, it seems that the most suitable underground line will be Parramatta to Central Station so as to ease the heavy footfall.

# Section 3b: Hypothesis Testing for Tap on/off

For this purpose, single factor ANOVA test was conducted in MS-Excel. The ‘count’ data for ‘on’ and ‘off’ was separated out in two columns (only for trains) and the test was run at alpha of 0.05. The hypothesis is:

H0: μon = μoff

H1: μon ≠ μoff

The result is as follows:

From above, we can see that the value of F (2.47) < F crit (3.86). Hence, we are unable to reject the null hypothesis. Concluding, we can say that there is no significant difference in the means of ‘tap on’ and ‘tap off’.

# Section 3c: Conclusion for section

We saw that out of the three stations selected for the purpose, Bankstown, Gosford and Parramatta, the heaviest footfall is at Parramatta. Further, the Central Station is very busy at even a heavier footfall. We also saw that the means of tap on and tap off do not vary significantly at the given alpha level of 0.05.

Hence it is recommended to build an underground railway line from Parramatta Station to Central Station so as to ease the heavy footfall.

# Section 5: Discussion & Conclusion

In above report, it was seen that during the period of 8th August to 14th August, 2016, the type of public transport utilised in NSW included bus, train, lightrail and ferry. Further, it was seen that very less percentage of passengers used ferry (2%) and lightrail (1%). The majority of people relied on buses (37%) and trains (60%).

It was seen that top 25 train stations accounted for almost 69% of the total train users whereas top 25 bus locations accounted for almost 53% of the total bus users during the period. The data is presented in graphs below (count on x-axis):

This indicates that the traffic can be much better managed if focus is on top 20 or top 25 locations instead of all of them.

It was found that there is significant difference between means of count of bus and train, indicating that train usage is much more than bus and may be accounting for more than 50% of the transport users.

Hence, focus is on improving traffic management at train stations. The Central Station (6.57%) is the second busiest station (after Town Hall Station at 10.09% of train users) as per the provided data. Analysis was done to understand the need for underground railway line from Central Station to one of the three selected stations (corresponding percentage of users in parenthesis): Parramatta (2.59%), Bankstown (0.80%) and Gosford (0.86%).

Looking at the data, it can be seen that Parramatta is the busiest of the three with highest footfall. Hence, an underground line will be most beneficial from Parramatta to Central Station.

The data provided categorization basis tap on and tap off as well but it was seen through hypothesis testing that there is no significant difference between the means of tap on and tap off. Hence, it seems that the categorization can be ignored for the purpose of analysis without any significant impact on the findings.