Group Project Specification
300958 Social Web Analytics
Aim
The Group Project provides us with a chance to analyse data from Twitter as a Social Web Application using knowledge from this unit and a computer based statistical package. For this project, we will focus on identifying a chosen Public Figure’s Twitter image.
Method
To complete this project:
Project Description
A well-known public figure is investigating their public image and has approached your team to identify how the public associates with them. They want five pieces of analysis to be performed.
Analysis of Twitter language about the Public Figure
In this section, we want to examine the language used in tweets. Use the rtweet package to download tweets.
We want to categorize (cluster) the users of the tweets about the Public Figure based on the descriptions provided in their Twitter account. Descriptions in the users' Twitter profiles give a
short piece of information about the Twitter handle. To cluster users, build a document term matrix by using the user descriptions of the tweets (note, not tweets themselves) you downloaded in section 8.1.
We want to examine if the length of tweets is dependent on the user clusters you found in Section 8.2.
Find the tweet lengths of the tweets of the users in each cluster with respect to the data in
tweets.
Find how many tweets are >= 100 characters in length and how many are below, in each cluster
Construct a 2×M table where M is the number of user clusters you found at Section 2. Each row (2 rows in total) should represent the total number of tweets with length >=100 and those below 100 in each cluster. For example, if M is 3, your data structure should be as follows:
Cluster 1 | Cluster 2 | Cluster 3 | |
Count of tweets with length >=100 | |||
Count of tweets with length <100 |
Is the length of tweets independent of user groups? Perform an appropriate test to answer this question.
Interpret your results in context.
Network of the Tweets
In this section, we want to examine the network of the tweets about the chosen person.
Document – Document similarity can be computed by multiplying Document Term Matrix
(D) with its transpose as S=DDT . Compute the tweet similarity matrix S from the tweets you downloaded at section 8.1. Use TFIDF weighting obtained from frequencies in your document term matrix. Note that the document similarity matrix will be symmetrical about the diagonal axis. An illustration of S is shown below:
Doc1 | Doc2 | Doc3 | Doc4 | |
Doc1 | Distance 11 | Distance 12 | Distance 13 | Distance 14 |
Doc2 | Distance 21 | Distance 22 | Distance 23 | Distance 24 |
Doc3 | Distance 31 | Distance 32 | Distance 33 | Distance 34 |
Doc4 | Distance 41 | Distance 42 | Distance 43 | Distance 44 |
Construct a Data Frame to convert either the top triangle or the bottom triangle in S into an edge list as illustrated below. For example, distance12 represents the distance between document 1 and 2.
Distance | Row Number | Column Number |
Distance 12 | 1 | 2 |
Distance 13 | 1 | 3 |
Distance 14 | 1 | 4 |
Distance 23 | 2 | 3 |
Distance 24 | 2 | 4 |
Distance 34 | 3 | 4 |
List the top 10 pairs of similar tweets based on the above distances
Find the corresponding usernames of the users involved in the top 10 pairs of similar tweets.
Use graph.data.frame function in igraph library to create a graph using the edge list data frame. Use the set_edge_attr function to set the weight of each edge to corresponding the distances between the tweets.
Find the sub-graph in which the degree of each node is greater than 2.
Plot the sub-graph. In case it consists of disconnected sub-graphs, use the decompose.graph function to plot each sub-graph.
Comment on your findings.
Note: The person wants the above analysis to be written up as a professional report. Ensure the report maintains the section numbers indicated in this document. Include only the relevant piece of code along with its output in the body of your assignment. Do not dump large chunks of output if it adds no value to your report.
For solution, connect with our online professionals.