301114 Nature of Data
Project Description
In this assignment there are 5 parts. For each part you should:
Draw an appropriate plot, • Provide an initial estimate of the solution given the plot (describe what the plot tells us about the problem), • Conduct and report the analysis, and • Describe the conclusions in words.
Project Questions
• City: The city in which the person lives. • Movie: A movie that the person most recently watched. • Age: The age of the person. • Rating: The rating (out of 10) that the person gave the movie. Using this data, perform the following analysis.
Answers 1:
Code
The r script is and bar chart is as follows:
data_count<-matrix(c(5131326,4850740,2408223,1313927,2050138,226884))
data_count
proportion <-prop.table(data_count,2)
prop<-c(proportion)
prop
city<-c("Sidney","Melbourne","Brisbane","Adelaide","Perth","Hobart")
barplot(prop,names.arg=city,xlab="City",ylab="Proportion",col="blue",
main="Proportion chart",border="red")
Result in R
Graph
Observations
From graph above, it can be seen that the x-axis represents various cities while the y-axis represents proportion of population picked from each city.
Analysis & Results
From graph above, it can be seen that the maximum population is from Sidney and Melbourne, followed by Brisbane and Perth, followed by remaining Adelaide and Hobart.
Conclusion
This order is in-line with that of the actual populations in cities. Hence, we can say population distribution sample matches.
Answer 2
For testing the dependency between the movie and city variable we have used the chi-square test concept as follows:
Code
file = "project2018S.csv"
data = read.csv(file, TRUE, ",")
movie <- data$Movie
city <-data$City
chisq.test(movie,city)
plot(movie,city)
Result
Graph
Observations
The graph above stacks various cities with the movie names. While x-axis provides names of movies, y-axis presents names of cities and proportion of people from those cities.
Analysis & Results
From graph above, it can be seen that the movie ‘The Cat with Two Tales’ is most popular in all cities, followed by ‘Washing Dishes’ and then ‘Undergoal’. Adelade and Brisbane show stronger preference for Washing Dishes as compared to other cities.
Conclusion
We cannot say conclusively that the preference varies, however, there is slight proof in context of ‘Washing Dishes’.
The r script for the harsh raters of the movie below the 40 and above and including 40 is as follows:
Code
file = "project2018S.csv"
data = read.csv(file, TRUE, ",")
above_40 <- data[data$Age >= 40,]
data_40 <- above_40$Rating
mean(data_40)
hist(data_40)
# Mean of the rating of the movies which is rated by the age below 40
below_40 <- data[data$Age < 40,]
data_less40 <- below_40$Rating
mean(data_less40)
hist(data_less40)
Result
Graph
Observations
The graphs above indicate histogram of ratings from people above 40 and those lower than 40. While x-axis represents the rating, y-axis represents the frequency for the same.
Analysis & Results
From graph above, it can be seen that there is visible difference in rating average from two age groups. Those over 40 are concentrated around rating of 6-8, those below 40 are concentrated around 7-8.
Conclusion
While both groups peak around rating 7, those over 40 tend to give lower ratings compared to those below 40. Hence, there is difference in average rating.
The movies are not equally preferred and also the mean ratings are not equal for each movie. The r script for this is as follows:
Code
file = "project2018S.csv"
data = read.csv(file, TRUE, ",")
movie <- data$Movie
rating <-data$Rating
moviename<-c(movie)
ratinggiven<-c(rating)
aggregate(rating, by=list(Movies=movie), FUN=mean)
barplot(moviename,names.arg=ratinggiven,xlab="Movie",ylab="Rating",col="blue",
main="Movie-Rating chart",border="red")
Result
Graph
Observations
The graphs above indicates XXX
Analysis & Results
There is clear difference in preference of movies as majority people prefer ‘The Cat with two tales’, followed by ‘Washing Dishes 3’, followed by ‘Undergoal’
Conclusion
As we saw above, there is clear difference in preferences of the three movies and the audience’s favourite seems to be ‘The Cat with Two Tales’.
Answer 5
We have used the linear regression model. In the question it is given that we have to use the rating as function of the age. The implementation is as follows:
Code
file = "project2018S.csv"
data1 = read.csv(file, TRUE, ",")
rating <- data$Rating
age<- data$Age
model <- lm(rating ~age)
model
summary(model)
new.Age <- data.frame(Age = c(32))
predict(model, newData=new.Age,interval = "confidence")
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(data1, pred.int)
plot(age,rating,col = "blue",main = "Age & Rating Regression",
abline(lm(rating~age)),cex = 1.3,pch = 16,xlab = "Age",ylab = "Rating")
Result
Graph
Observations
The graphs above indicates XXX
Analysis & Results
Using linear regression, the intercept is 7.95 while the coefficient of Age (x) is -0.02.
Conclusion
We can say that the intercept is not 0 as the value is 7.95. Further, the rating reduces as, coefficient of x is -0.02, indicating the overall rating will reduce as the age increases.