Examination And Analysis Of Data From Australian Cities

pages Pages: 4word Words: 890

Question :

Project Autumn 2018

301114 Nature of Data

Project Questions

The questions in this project examine the data in the file project2018S.csv. Download the file from the vUWS site. Each row of the file is an observation relating to one person from Australia. The file contains the variables:

• City: The city in which the person lives. 

• Movie: A movie that the person most recently watched. 

• Age: The age of the person. 

• Rating: The rating (out of 10) that the person gave the movie.

Using this data, perform the following analysis.

1. Representative Sample

We should first check that the data is representative of the population. Our sample is from the six Australian cities with population counts shown below

City Poplulation Proportion 

Sydney 5131326 0.321 

Melbourne 4850740 0.304 

Brisbane 2408223 0.151 

Adelaide 1313927 0.082 

Perth 2050138 0.128 

Hobart 226884 0.014

Test if the distribution of cities in the data matches the above population proportions of the cities.

2. Chosen Movie

It is believed that people from difference cities have different preferences. Is there a dependence between the sample variables City and Chosen Movie?

3. Harsh Raters

As people grow older, it is believed that they expect more from movies and so provide lower ratings. To answer this question, test if there is a difference in the mean rating for people over and including 40 years old, compared to the mean rating from people under 40 years old.

4. Best Movie

Is there evidence that the movies are not equally preferred? Test if all mean ratings are equal for each movie? If there is not equal preference, which is preferred over which?

5. Age and Ratings

We found that older people rate movies lower. Using a linear model of Rating as a function of Age, is there evidence that the gradient is not zero? Use the model to predict the expected rating for someone of age 32.

Show More

Answer :

Nature of Data

Answer 1

Code

The r script of this part is as follows:

counts<-matrix(c(5131326,4850740,2408223,1313927,2050138,226884))

counts

prop.table(counts,2)

counts<

-barplot(prop.table(counts,2))

r Script of Representative Sample

Graph

The barplot is as follows:

Barplot of Representative Sample

Observation

In this we have given a sample of data which contain the columns as city, population and proportion. We have to calculate the proportion using the population.

Analysis

We can see that the proportion of population is similar to actual population given in data set. The maximum population is from Sydney and Melbourne, followed by remaining cities in similar order.

Conclusion

Hence, we can say that the populations match.

Answer 2

Code

The r script is as follows:

file = "project2018S.csv"

data = read.csv(file, TRUE, ",")

res <- chisq.test(data$City,data$Movie)

res

plot(data$City, data$Movie)

r Script of sample variables City and Chosen Movie

Graph

The graph screenshot:

Graph Screenshot of sample variables City and Chosen Movie 

Observation

To solve this question, first we will read the data from the csv file using read.csv file. The main problem in this part is we have to check the dependency between the variable city and movie. The both variables are categorical variable. So, to check the dependency between the variables we will perform the chi-square test to check the dependency.

Analysis

In the above figure the value is 10.375 and p-value is 0.4082 these tells that both variables are not dependent. Further, the graph also shows that movie ‘The cat with two tales’ is a clear favourite irrespective of the cities. Similarly, it is followed by ‘Washing Dishes 3’ which is in turn, followed by ‘Undergoal’ The pattern holds true irrespective of cities.

Conclusion

We can conclude that the movie preference is not dependent on the city.

Answer 3

Code

The script is as follows:

file = "project2018S.csv"

data = read.csv(file, TRUE, ",")

dataF <- data[data$Age >= 40,]

mean(dataF$Rating)

barplot(dataF$Rating)

# Mean of the rating of the movies which is rated by the age below 40

dataF1 <- data[data$Age < 40,]

mean(dataF1$Rating)

barplot(dataF1$Rating)

r Script of mean rating for people over and including 40 years old to people under 40 years old

Graph

The plot is as follows:

Graph of mean rating for people over and including 40 years old to people under 40 years old


Observation

We have to find the difference in the mean rating of the people over and including the 40 years old compared to the below 40 years old. 

Analysis

In the above screenshot image we can easily see that there is difference between the rating of people who is below 40 and above and include 40.

Conclusion

Hence, we can conclude that those over 40 years of age give lower rating than those who are under 40 years of age.

Answer 4

Code

The r script is as follows:

file = "project2018S.csv"

data = read.csv(file, TRUE, ",")

aggregate(data$Rating, by=list(Movies=data$Movie), FUN=mean)

r Script of mean ratings for each movie

Graph


Graph of mean ratings for each movie

Observation

In the above figure we can see that movie with name The Cat with Two Tales is most preferred movie among the people. Because its mean rating is highest among the other movies. After this movie Undergoal is second preferred movie. So, we can say that The Cat with Two Tales is preferred among the other two while the Undergoal is preferred  over the Washing Dishes3.

Analysis

To find the mean of the movies we have calculated the mean using the aggregate function. 

Conclusion

Yes, there is evidence that movies are not equally preferred. We have confirmed it when we have tried to find the mean of the ratings of this movie.

Answer 5

Code

The r script for linear model is :

file = "project2018S.csv"

data1 = read.csv(file, TRUE, ",")

model <- lm(data$Rating ~ data$Age, data = data1)

model

summary(model)

new.data1 <- data.frame(

  Age = c(32)

)

predict(model, newData=new.data1,interval = "confidence")

 pred.int <- predict(model, interval = "prediction")

mydata <- cbind(data1, pred.int)

# 2. Regression line + confidence intervals

library("ggplot2")

p <- ggplot(mydata, aes(Age, Rating)) +

  geom_point() +

  stat_smooth(method = lm)

p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+

  geom_line(aes(y = upr), color = "red", linetype = "dashed")

r Script for model to predict the expected rating for someone of age 32.Graph

Graph of model to predict the expected rating for someone of age 32.


Observation

In this we have to predict the value using the linear regression model. In this we have given that rating as function of the age. 

Analysis

In this we have implemented the linear function using the lm() function. After making the linear model we have used this model to predict the value when age is 32.

Conclusion

There is clear evidence that there is relationship between age and rating. Further, the intercept is not zero and overall rating tends to go down as age increases.

Tags:301114