Identifying Chosen Public Figures Twitter Image: Social Web Analytics Assessment Answer

pages Pages: 4word Words: 890

Question :

Group Project Specification

300958 Social Web Analytics


The Group Project provides us with a chance to analyse data from Twitter as a Social Web Application using knowledge from this unit and a computer based statistical package. For this project, we will focus on identifying a chosen Public Figure’s Twitter image.


To complete this project:

  1. Read through this specification.
  2. Form a group and register the members in your group using the Project Groups section of vUWS.
  3. Choose a famous person who is active on Twitter, check that the person is not already on the list of Group Project Twitter Handles . Then submit the Twitter handle of the person using the above link. It is It is your responsibility to ensure that your group chooses a person who is not already in the list. If you do so, the group with the later time stamp will be asked to find a new person.
  4. Complete the data analysis required by the specification.
  5. Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and presented well. Include the necessary R code along with its output in your assignment.
  6. Include the student declaration text on the front page of your report. Please make sure that the names and student numbers of each group member are clearly displayed on the front page. If a group member did not contribute to any part of the project, state their contribution as 0% (no contribution means 0 mark).
  7. Submit the report as a PDF by the due date using the Submit Group Project link
  8. All code and the outputs must be shown in the project, also include comments in the code to explain what you tried to do. Put all the code in the text (not to the Appendix). Any submissions other than a PDF file will not be marked.

Project Description

A well-known public figure is investigating their public image and has approached your team to identify how the public associates with them. They want five pieces of analysis to be performed.

Analysis of Twitter language about the Public Figure

In this section, we want to examine the language used in tweets. Use the rtweet package to download tweets.

  1. Use the search_tweets function from the rtweet library to search for 1000 tweets about the person you selected. Save these tweets as “tweets”. The data in this set should be the same for all members in a group. Hence only one member should download the tweets and then save it in an RData file to be shared with your group members.
  2. Clean and pre-process the tweet text data in tweets.
  3. Display the first two tweets before and after the cleaning/processing.
  4. Use the text in the tweets as the source for the matrix to construct a document-term matrix referenced by the variable “ tweets.dtm ”. Use TFIDF weights to find weighted document term matrix and weighted term document matrix. Store this in variable “tweets.wdtm”.
  5. Find how many documents were empty following the cleaning/processing.
  6. Find all unique words in your document collection. Store them in a variable "words". Find the sum of TFIDF weights of these unique words. Store them in a variable "wordsWeight". Display the first 6 and last 6 unit words in your data set.
  7. Find and display the top 100 words based on TFIDF weights.
  8. Get a world cloud of the top 100 words.
  9. Draw a bar plot of the top 100 words.
  10. Find all terms in your weighted document term matrix with a minimum weight of 10 (or any other appropriate value).
  11. Use cosine distance method to get a distance matrix between terms.
  12. Create a dendrogram of the words identified by question 8.10. Try simple and complete linkage clustering.
  13. What do these words tell us about the person? Comment on what people are saying about this person.

    Clustering the Users Who Posted Tweets About the Public Figure

We want to categorize (cluster) the users of the tweets about the Public Figure based on the descriptions provided in their Twitter account. Descriptions in the users' Twitter profiles give a

short piece of information about the Twitter handle. To cluster users, build a document term matrix by using the user descriptions of the tweets (note, not tweets themselves) you downloaded in section 8.1.

  1. Use rtweet package’s users_data(tweets) function to extract users’ data from tweets data object you downloaded in Section 8.1. Store the unique author result in variable, authors.
  2. Clean the data by pre-processing and then create a weighted Term Document Matrix using unique users’ descriptions.
  3. Compute the appropriate number of clusters using the elbow method. Use cosine distance.
  4. Cluster the users and visualize the clusters in two dimensional vector space.
  5. Display the count of users in each cluster
  6. List a maximum of 10 screen names of users in each cluster.
  7. List the top 10 words in each cluster
  8. Display the description of the first five users in each cluster.
  9. Comment on your findings.

Tweet Length Analysis of the User Clusters

We want to examine if the length of tweets is dependent on the user clusters you found in Section 8.2.

Find the tweet lengths of the tweets of the users in each cluster with respect to the data in


Find how many tweets are >= 100 characters in length and how many are below, in each cluster

Construct a 2×M table where M is the number of user clusters you found at Section 2. Each row (2 rows in total) should represent the total number of tweets with length >=100 and those below 100 in each cluster. For example, if M is 3, your data structure should be as follows:

Cluster 1
Cluster 2
Cluster 3
Count of tweets with length >=100

Count of tweets with length <100

Is the length of tweets independent of user groups? Perform an appropriate test to answer this question.

Interpret your results in context.

Network of the Tweets

In this section, we want to examine the network of the tweets about the chosen person.

Document – Document similarity can be computed by multiplying Document Term Matrix

(D) with its transpose as S=DDT . Compute the tweet similarity matrix S from the tweets you downloaded at section 8.1. Use TFIDF weighting obtained from frequencies in your document term matrix. Note that the document similarity matrix will be symmetrical about the diagonal axis. An illustration of S is shown below:

Distance 11
Distance 12
Distance 13
Distance 14
Distance 21
Distance 22
Distance 23
Distance 24
Distance 31
Distance 32
Distance 33
Distance 34
Distance 41
Distance 42
Distance 43
Distance 44

Construct a Data Frame to convert either the top triangle or the bottom triangle in S into an edge list as illustrated below. For example, distance12 represents the distance between document 1 and 2.

Row Number
Column Number
Distance 12
Distance 13
Distance 14
Distance 23
Distance 24
Distance 34

List the top 10 pairs of similar tweets based on the above distances

Find the corresponding usernames of the users involved in the top 10 pairs of similar tweets.

Use function in igraph library to create a graph using the edge list data frame. Use the set_edge_attr function to set the weight of each edge to corresponding the distances between the tweets.

Find the sub-graph in which the degree of each node is greater than 2.

Plot the sub-graph. In case it consists of disconnected sub-graphs, use the decompose.graph function to plot each sub-graph.

Comment on your findings.

Note: The person wants the above analysis to be written up as a professional report. Ensure the report maintains the section numbers indicated in this document. Include only the relevant piece of code along with its output in the body of your assignment. Do not dump large chunks of output if it adds no value to your report.

Show More

Answer :

For solution, connect with our online professionals.