# Nature Of Data: The R Script Is And Bar Chart

Pages: 4 Words: 890

## Question :

301114 Nature of Data

Project Description

In this assignment there are 5 parts. For each part you should:

Draw an appropriate plot, • Provide an initial estimate of the solution given the plot (describe what the plot tells us about the problem), • Conduct and report the analysis, and • Describe the conclusions in words.

Project Questions

• City: The city in which the person lives. • Movie: A movie that the person most recently watched. • Age: The age of the person. • Rating: The rating (out of 10) that the person gave the movie. Using this data, perform the following analysis.

Code

The r script is and bar chart is as follows:

data_count<-matrix(c(5131326,4850740,2408223,1313927,2050138,226884))

data_count

proportion <-prop.table(data_count,2)

prop<-c(proportion)

prop

barplot(prop,names.arg=city,xlab="City",ylab="Proportion",col="blue",

main="Proportion chart",border="red")

Result in R

Graph

Observations

From graph above, it can be seen that the x-axis represents various cities while the y-axis represents proportion of population picked from each city.

Analysis & Results

From graph above, it can be seen that the maximum population is from Sidney and Melbourne, followed by Brisbane and Perth, followed by remaining Adelaide and Hobart.

Conclusion

This order is in-line with that of the actual populations in cities. Hence, we can say population distribution sample matches.

For testing the dependency between the movie and city variable we have used the chi-square test concept as follows:

Code

file = "project2018S.csv"

movie <- data\$Movie

city <-data\$City

chisq.test(movie,city)

plot(movie,city)

Result

Graph

Observations

The graph above stacks various cities with the movie names. While x-axis provides names of movies, y-axis presents names of cities and proportion of people from those cities.

Analysis & Results

From graph above, it can be seen that the movie ‘The Cat with Two Tales’ is most popular in all cities, followed by ‘Washing Dishes’ and then ‘Undergoal’. Adelade and Brisbane show stronger preference for Washing Dishes as compared to other cities.

Conclusion

We cannot say conclusively that the preference varies, however, there is slight proof in context of ‘Washing Dishes’.

The r script for the harsh raters of the movie below the 40 and above and including 40 is as follows:

Code

file = "project2018S.csv"

above_40 <- data[data\$Age >= 40,]

data_40 <- above_40\$Rating

mean(data_40)

hist(data_40)

# Mean of the rating of the movies which is rated by the age below 40

below_40 <- data[data\$Age < 40,]

data_less40 <- below_40\$Rating

mean(data_less40)

hist(data_less40)

Result

Graph

Observations

The graphs above indicate histogram of ratings from people above 40 and those lower than 40. While x-axis represents the rating, y-axis represents the frequency for the same.

Analysis & Results

From graph above, it can be seen that there is visible difference in rating average from two age groups. Those over 40 are concentrated around rating of 6-8, those below 40 are concentrated around 7-8.

Conclusion

While both groups peak around rating 7, those over 40 tend to give lower ratings compared to those below 40. Hence, there is difference in average rating.

The movies are not equally preferred and also the mean ratings are not equal for each movie. The r script for this is as follows:

Code

file = "project2018S.csv"

movie <- data\$Movie

rating <-data\$Rating

moviename<-c(movie)

ratinggiven<-c(rating)

aggregate(rating, by=list(Movies=movie), FUN=mean)

barplot(moviename,names.arg=ratinggiven,xlab="Movie",ylab="Rating",col="blue",

main="Movie-Rating chart",border="red")

Result

Graph

Observations

The graphs above indicates XXX

Analysis & Results

There is clear difference in preference of movies as majority people prefer ‘The Cat with two tales’, followed by ‘Washing Dishes 3’, followed by ‘Undergoal’

Conclusion

As we saw above, there is clear difference in preferences of the three movies and the audience’s favourite seems to be ‘The Cat with Two Tales’.

We have used the linear regression model. In the question it is given that we have to use the rating as function of the age. The implementation is as follows:

Code

file = "project2018S.csv"

rating <- data\$Rating

age<- data\$Age

model <- lm(rating ~age)

model

summary(model)

new.Age <- data.frame(Age = c(32))

predict(model, newData=new.Age,interval = "confidence")

pred.int <- predict(model, interval = "prediction")

mydata <- cbind(data1, pred.int)

plot(age,rating,col = "blue",main = "Age & Rating Regression",

abline(lm(rating~age)),cex = 1.3,pch = 16,xlab = "Age",ylab = "Rating")

Result

Graph

Observations

The graphs above indicates XXX

Analysis & Results

Using linear regression, the intercept is 7.95 while the coefficient of Age (x) is -0.02.

Conclusion

We can say that the intercept is not 0 as the value is 7.95. Further, the rating reduces as, coefficient of x is -0.02, indicating the overall rating will reduce as the age increases.

Tags: