Data Analysis On Health And Population Statistics

Pages: 4 Words: 890

Question :

ICT110 Introduction to Data Science

Background

A research team planned to study the development of Australia since 1995. The team retrieved the dataset from World Bank (http://databank.worldbank.org) about Development Indicators between 1995 and 2016.

The dataset covers various interesting aspects about the country, such as GDP, Trade, Population, Tax, Income, Health, etc.

The details about the data attributes and data content can be found in the attached documents.

You are a member of the team, and need to perform data analysis on selected attributes. The team has not set any specific goal for the analysis. Therefore, you have the freedom to explore the data, and dig out anything you feel interesting or significant.

You have been requested to prepare a data analysis report about your work and explain your findings. The potential audiences include other researchers, business representatives, and government agencies. They may have limited ICT or mathematical knowledge.

To prepare the report, please include the following sections:

1. Introduction

Provide an introduction to the problem. Include background material as appropriate: who cares about this problem, what impact it has, where does the data come from, what are the dimensions and structures of the data.

2. Data Setup

Describe how to load the data, and how the pre-processing is performed.

The data set is directly downloaded from World Bank, and may not be in the form suitable for analysis. At least, it is different from the data organisations that we are familiar with in previous practices. This means we need to do some pre-processing, either for the whole dataset, or for a subset of the dataset required for each sub task described later.

Once you have some ideas of exploratory or advanced analysis, you need to adjust the form of data set. This can be achieved either by manipulating records in R by transposition or subsetting, or with other tools (e.g. notepad or excel) before reading into R. Please explain your solution in this section.

3. Exploratory Data Analysis

Perform 3 one-variable analysis. Plot at least one graph for each variable. Explain why the selected graph is appropriate.

Perform 2 two-variable analysis. Plot at least one graph for each variable. Explain why the selected graph is appropriate. The analysis can be performed on all years, or on a subset of your interest.

ICT110 Introduction to Data Science Assignment 2 Page 4 of 6

4.1 Clustering  Briefly explain the concept of clustering and k-means.  Try to do a clustering analysis to group years according to a selected attribute.

4.2 Linear Regression  Briefly explain the concept of linear regression.  Try to do 2 linear regression analysis. Plot the learned models. The analysis can be performed on all years, or on a subset of your interest.

5. Conclusion

6. Reflections In this part, discuss any difficulties you had performing the analysis and how you solved those difficulties. Reflect on how the analysis process went for you, what you learnt, and what you might do differently next time. For the data analysis, you need to provide both R code, and the explanation to the code and the result. For the section 2 – 4, please represent each R code snippet in a box with some comments. For example: # Draw a boxplot on the attribute “

1.0 Introduction

This report is limited to the research supplied in BUS110 “Health and Population Statistics between 2001 and 2015” and the skills supplied in the course Introduction to Data Science at the University of the Sunshine Coast (Nash 2013).

This report looks at data analysis using R Console, the data being analysed has be supplied by BUS110 course. The data supplied contains information about Health and Population Statistics between 2001 and 2015. Select data has been dug out using R console and transcribed in to appropriate graphs. The use of clustering and liner regression within R console has also been used to compare select countries.

To complete this report secondary data from reputable academic sources has been researched and a excel csv containing data on Health and Population Statistics between 2001 and 2015 was supplied (Nash 2013). R console skills have been taught throughout the BUS110 course and used to complete this assessment.

2.0 Data Setup

First step in data setup is downloading data from blackboard into an excel document. Once this has been done and data has been looked through and checked for correct formatting e.g. all columns and rows are in their right places. Then data is converted into a CSV format (coma separated values), this is done so R can easily separate each data point. Once document has been saved in CSV format, it is then placed in the network drive file. This so R can easily locate the document. Next step is loading CSV data on to R. This is done through R using set commands

This is the first step in R it will only work if all other steps before hand have been completed.

> #Reading Data into R console and placing it under another name

> data <- read.csv("Health and Population Statistics_Data.csv")

The Health and Population Statistics (HPS) are read into R and saved to data. This will make it easier to call on data later on.

> # Displaying the data in R console

> data

Then to check that HPS.csv have correctly upload and saved to data we print data, this is done by just entering (data) into R. With a bit of luck it will print out all of the data in the same format as it was in HPS. csv. Once this step is completed you can alter the data to get information you want from it, like these steps below.

> #Dropping Series Name

> data\$Series.Name <- NULL

> #Dropping Country Name

> data\$Country.Name <- NULL

This is done due to there being double ups of the same data just in different formats.

This means all data has now been loaded onto R allowing sections 3 and 4 to be possible.

3.0 Exploratory Data Analysis

3.1 Single Variable Analysis

3.1.1 Graph 1

>male <- c(77, 77.4, 77.8, 78.1, 78.5, 78.7, 79, 79.2, 79.3, 79.5, 79.7, 79.9, 80.1, 80.3)

#Range of graph

>g_range <- range(70, male) 3)

#Plot male data in blue

>plot(male, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE)

#Numbers displayed on Y axis

> axis(2, las=1, at=1*0:g_range[2])

#Years displayed on X axis

> axis(1, at=1:14, lab=c("2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014"))

#Putting box around graph

> box()

#Main title

> title(main="Male Life Expectancy in Australia, (years)", col.main="black", font.main=4)

#Title of x axis

> title(xlab="Years")

#Title of y axis

> title(ylab="Age")

3.1.2 Graph 2

>female <- c(82.4, 82.6, 82.8, 83, 83.3, 83.5, 83.7, 83.7, 83.9, 84, 84.2, 84.3, 84.3, 84.3)

#Range of graph

>g_range <- range(70, female) 3)

#Plot female data in red

>plot(female, type="o", col="red", ylim=g_range, axes=FALSE, ann=FALSE)

#Numbers displayed on y axis

> axis(2, las=1, at=1*0:g_range[2])

#Number displayed on x axis

> axis(1, at=1:14, lab=c("2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014"))

#Box around graph

> box()

#Main title

> title(main=”Female Life Expectancy in Australia, (years)", col.main="black", font.main=4)

#X axis title

> title(xlab="Years")

#Y axis title

> title(ylab="Age")

3.1.3 Graph 3

#Data for Fertility rate in woman for all countries in 2014 loaded to (firt)

>firt <- c(1.859, 1.874, 2.635, 1.562, 2.564, 2.043, 2.386, 1.234, 2.463, 1.42, 3.73, 1.979, 1.205, 2.991, 1.243, 1.944, 3.243, 2.655, 2.204, 2.24, 1.92, 3.757, 2.977, 4.086, 1.25, 3.966, 1.512, 5.1, 3.722, 3.347, 1.961)

#Code to create Barplot in R using (firt) for Data

>barplot(firt, main="Fertility rate for 2014", xlab="Country", ylab="Births per woman", names.arg=c("AUS", "BRN", "KHM", "CHN", "FJI", "PYF", "GUM", "HKG", "IDN", "JPN", "KIR", "PRK", "KOR", "LAO", "MAC", "MYS", "FSM", "MNG", "MMR", "NCL", "NZL", "PNG", "PHL", "WSM", "SGP", "SLB", "THA", "TLS", "TON", "VUT", "VNM"), boreder="red")

3.2 Dual Variable Analysis

3.2.1 Graph 1

#Range of graph

>g_range <- range(77, male, female)

#Plot female data in red

> plot(female, type="o", col="red", ylim=g_range, axes=FALSE, ann=FALSE)

#Years displayed on X axis

> axis(1, at=1:14, lab=c("2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014"))

#Numbers displayed on Y axis

> axis(2, las=1, at=1*0:g_range[2])

#Putting Box around Graph

> box()

#Plot male data in blue

> lines(male, type="o", pch=22, lty=2, col="blue")

#Main title

> title(main="Female and Male Life Expectancy in Australia, (years)", col.main="black", font.main=4)

#X axis Title

> title(xlab="Years")

#Y axis Title

> title(ylab="Age")

#Creation of a legend

> legend(1, g_range[2], c("Female","Male"), cex=0.8, col=c("red","blue"), pch=21:22, lty=1:2);

3.2.2 Graph 2

#Range of graph

> g_range <- range(0, firt, firt01)

#Plot Fertility of 2014 in all countries

> plot(firt, type="o", col="red", ylim=g_range, axes=FALSE, ann=FALSE)

#Countries on X axis

> axis(1, at=1:31, lab=c("AUS", "BRN", "KHM", "CHN", "FJI", "PYF", "GUM", "HKG", "IDN", "JPN", "KIR", "PRK", "KOR", "LAO", "MAC", "MYS", "FSM", "MNG", "MMR", "NCL", "NZL", "PNG", "PHL", "WSM", "SGP", "SLB", "THA", "TLS", "TON", "VUT", "VNM"))

#Numbers on Y axis

> axis(2, las=1, at=1*0:g_range[2])

#Putting a Box around Graph

> box()

#Plot fertility of 2001 in all countries

> lines(firt01, type="o", pch=22, lty=2, col="blue")

#Main title

> title(main="Fertility Rate 2001 & 2014", col.main="black", font.main=4)

#X axis title

> title(xlab="Country")

#Y axis title

> title(ylab="Birth Per Woman")

#Creation of legend

> legend(1, g_range[2], c("2014","2001"), cex=0.8, col=c("red","blue"), pch=21:22, lty=1:2);

4.1 Clustering

Clustering is a form of grouping (clustering) data in an unsupervised classification of patters (observation, data items, or future vectors) (Jain, Murty and Flynn 1999). Clustering can be preformed through R, allowing easy and fast information retrieval. Clustering tends to be random and meaning the researchers must find/workout the patterns and why the data as been clustered in set way. Clustering in R can be done with multiple variables and multiple clusters.

4.2 Linear Regression

Regression analysis is the possible relationship that exists between two variables (Seber and Lee, 2003). Liner Regression is the relationship between a Y variable and X variable. The Y variable being scalar dependent and the X being an explanatory variable. Two examples of linear regression within R are supplied below.

#This graph depicts the relationship between average Male and female mortality age form 2001 and 2014.

> fit <- lm(female ~ male)

> fit

Call:

lm(formula = female ~ male)

Coefficients:

(Intercept)         male

33.1219       0.6395

> sum(fit\$residuals^2)

[1] 0.07980418

> plot(female ~ male)

> abline(fit, col="red")

# from this data we can see the corresponding age in women to men who died at age 80 in Australia.

> predict(fit, data.frame(male=80))

1

84.27941

#This Green line was not displayed in current chart. Just shows the intersect at on the red abline for men aged 80.

> abline(v=80, col="green")

# This graph depicts the relationship between average fertility rate for all countries form 2001 and 2014.

> fit <- lm(firt2001 ~ firt2014)

> fit

Call:

lm(formula = firt2001 ~ firt2014)

Coefficients:

(Intercept)         firt2014

-0.5457         1.3690

> sum(fit\$residuals^2)

[1] 4.175939

> plot(firt2001 ~ firt2014)

> abline(fit, col="red")

# From this data we can see the corresponding fertility rate of 2 children in 2014 compared with fertility rates in 2001, this being for all countries will be very unspecific and probably incorrect for most countries.

> predict(fit, data.frame(firt2014=2))

1

2.192217

#This Green line was not displayed in current chart. It shows the intersect on the red abline for amount of children being 2 in 2014. We can use this to compare what 2 children in 2014 would be in 2001.

> abline(v=2, col="green")

6.0 Conclusion

This report detailed the use of r console in relation to Health and Population Statistics within the East Asia & Pacific. Data setup form Excel to r console and formatting with R have been preformed, the use of R console to create single and dual variable graphs using specific data from Health and Population Statistics was completed. Cluster and Linear Regression have also been attempted and brief explanation has been provided for both. It was found that average age of death in Australia had in creased in both female & male population since 2001 to 2014. Difference in fertility rates between all countries supplied in 2014 was depicted in a bar graph. 2 variable data graphs were used to show difference in female and male age of death from 2001 to 2014, and the difference in fertility rates for all counties in East Asia Pacific between 2001 and 2014. Linear Regression was used to calculate the intercept for both female and male age of death between 2001 and 2014, and fertility rates for all countries between 2001 and 2014. In conclusion the data tells us that Australians are living longer than ever before and fertility rates are decreasing for the majority of East Asia & Pacific Countries. With increased time and data analysing the amount of information accessible is unbelievable.

7.0 Reflections

In reflection I feel I may of wasted time in the beginning of my assessment purely with R coding and trying to get the best format for my data. I struggled massively with getting the original data to do what I wanted it to do in graphs and in the end I just coped data in to R, it was a lot easier and I was able to get on with the actual marked work. If I were able to do it again I would use more information, data and time to create more interesting graphs. I found the data very interesting and I would love to be able to spend more time on it. I can understand the usefulness of using R to find key and interesting information from vast data files. I feel with more time, practice and learning my abilities in R will increase and I will be able to use R to its full extent if possible.

Tags: