Big Data Analytics Project: Analyzing ACM Citation Network Assessment Answer

pages Pages: 4word Words: 890

Question :

Big Data Analytics Project Spring 2020, Option A Analyzing ACM Citation Network

1. ACM Citation Dataset

ACM citation data is extracted from DBLP, ACM, and other sources. The dataset is available on this website ( ). We use ACM-Citation-network V8 which consists of 2,381,688 papers and 10,476,564 citation relationships. After downloading the dataset, extract and look at the file “citation-acm-v8.txt” This file is the input to your program.

Each line in the “citation-acm-v8.txt” starting with a specific prefix indicates an attribute of the paper. More specifically,

#* paperTitle #@ Authors #t Year

#c publication venue

#index 00index id of this paper

#%the id of references of this paper (there are multiple lines, with each indicating a reference) #!Abstract

The following is an example:

#*Information geometry of U-Boost and Bregman divergence

#@Noboru Murata,Takashi Takenouchi,Takafumi Kanamori,Shinto Eguchi #t2004

#cNeural Computation #index436405 #%94584









#!We aim at an extension of AdaBoost to U-Boost, in the paradigm to build a stronger classification machine from a set of weak learning machines. A geometric understanding of the Bregman divergence defined by a generic convex function U leads to the U-Boost method in the framework of information geometry extended to the space of the finite measures over a label set. We propose two versions of U-Boost learning algorithms by taking account of whether the domain is restricted to the space of probability functions. In the sequential step, we observe….

This means that the paper titled “information geometry of U-Boost and Bregman divergence” is written by Noboru Murata,Takashi Takenouchi,Takafumi Kanamori, and Shinto Eguchi

In 2004. The paper’s index is 436405 and this paper has referenced (i.e., linked to) papers with indices ( 94584, 282290, 6055446,….) The last paragraph starting with #! is the abstract of the paper.

Step 1. Building ACM Citation Network

The first thing you need to do is to extract the citation graph from citation-acm-v8.tx file. You want to produce a dataset that contains records of the following form:

Paper index 1                                         paper index 2

Where paper index 1 has cited (linked to) paper index 2. This dataset is typically called a “link graph”. How can you extract the link graph from the input file?

When you use sc.TextFile in spark, the record delimiter by default is the new line character i.e., every record is a line of input. For this project, we want a record to be all the lines regarding one paper. Since each paper starts with “#*”, we can customize the record delimiter to be“#*” instead of the new line character.  That way everything between “#*” and the next “#*” in the input file is stored in a single record.

To customize the record delimiter in spark, please add the following lines to your scala program:

Import org.apache.hadoop.conf._ Import

Import org.apache.hadoop.mapreduce.lib.input._ @transient val hadoopConf= new Configuration hadoopConf.set(“textinputformat.record.delimiter”,”#*”)

Now you can create an rdd from the input file as follows:

Val inputrdd= sc.newAPIHadoopFile(<input file>,classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map{case(key,value)=>value.toString}. filter(value=>value.length!=0)

After this operation every rdd element is a chunk of data, representing a single paper starting with the paper title. For example,

Information geometry of U-Boost and Bregman divergence

#@Noboru Murata,Takashi Takenouchi,Takafumi Kanamori,Shinto Eguchi #t2004

#cNeural Computation #index436405 #%94584









#!We aim at an extens

Take a few samples of rdd elements to see how they look like and then write a function to extract all the references for each paper and create an rdd in the following form:

Paper_index1 paper_index2

Where paper_index1 links to (cites) paper_index2

Note: Some papers may not have any outgoing links (that is they may not have any references starting with #%) . You can just ignore those papers and not emit anything for them.

Step 2. Performing Graph Analytics on ACM Citation Network

In this step, you are to analyze the citation graph as follows:

  1. Visualize the in-degree distribution of the ACM citation network.
  2. Implement a weighted page rank to find the most influential papers in the ACM citation dataset 3- Find the average clustering coefficient of the graph

These requirements are explained in more details in the following sections:

Step 2.1 visualizing the in-degree distribution of the ACM citation network

The output of this part should be a graph which shows the in-degree-distribution of the citation network. The in-degree of a paper is the number of other paper which linked to (cited) p.

The in-degree distribution p(k) of a network is the fraction of nodes in the network with degree k.

nk/n That is: P(k)= where nis the number of papers with k in-links and n is the total number of nodes in the graph.

This part of the project is similar to the second problem in assignment 5 and you can use Spark Graphx library to get the total number of nodes and the number of in-links to each node, and

calculate the in-degree distribution.

Similar to assignment 5, The x-axis in the graph should represent k (in-degree) and the y-axis should represent p(k) (fraction of nodes with in-degree k). Please use logarithmic scale for both x and y axis.

Note: To convert a paper id to a long integer, you can either use distinct along with zipwithIndex to obtain a unique index for each paper_id. Or you can use stringHash function in scala.util.Murmurhash class.

Step 2.2 Implementing the Weighted Page Rank Algorithm

The page rank algorithm measures the importance of a page by the importance of the other pages that linked to it. The page rank algorithm in general can be applied to any type of connected graph. For example, in social graphs, where the nodes are people and the edges are the friendship relation, the page rank algorithm can be used to identify most popular people. Or in the World Wide Web graph, where the nodes are websites and the edges are the links between the website, the page rank algorithm can be used to identify and rank most relevant pages.

This step is the major step of you project. In this step, you are to implement a weighted pageRank algorithm on the ACM citation network data set to identify the most influential papers. The goal is to find the top 10 papers with the highest weighted page rank in ACM citation network. The pagerank algorithm used here is based on a highly cited paper by Xing and Ghorbani [1]. As a graduate student, you should be able to read this paper, understand it and implement its proposed algorithm. Nevertheless, I write a short summary of the weighted page in the following section.

You can implement the weighted page rank algorithm using spark core or spark sql.

Once you calculated the page rank for all papers in your link dataset, find the top 10 papers with the highest page rank. Your output must be in the following form:

paper title     the number of citations (in-links)       and the page rank

First try your program on a small link graph and test and debug your code. I typically like to develop my program incrementally in spark-shell and take a few samples of each rdd after performing operations on it.

Once you are confident that your program produces correct page ranks for smaller data, you can run it on your UIS cluster or EMR for the full dataset. Make sure that you clean up disk on your UIS cluster, empty your trash and remove all the old files from local disk and hdfs. Also remove the log files on hadoop/logs folder and spark/logs folder. You can run your code on Amazon Elastic Mapreduce (EMR), if your UIS cluster runs out of memory or disk. If you decide to run it on EMR, you can refer to this guide to launch a spark cluster EMR service is not free, but you can get some student credits by signing up on AWS educate:

What is Weighted PageRank and how it is computed

Weighted page rank is an extension to the standard page rank algorithm which takes into account the importance of in-links and out-links of a page and distributes rank scores based on the popularity of a page.

More formally, Weighted PageRank is a function that assigns a real number to each node in a graph. The intent is that the higher the PageRank of a node, the more important it is. The weighted page rank PR(u) of a node is calculated as follows [1]:

Weighted PageRank


  • dd: is a constant (it is typically set to 0.85)
  • N: is the total number of nodes (In this case total number of papers in the citation network)
  • in(u): The set of all the nodes which link to node ( In this case, the set of all papers which cited u) in
  • (v,u : is an in-weight of link(v, u). It is calculated as the ratio of the number of in-links (incoming links) to node u over the number of in-links to all references of node v.

in-weight of link

Where, out(v) means all the reference nodes of v ( all nodes to which v links) out

  • w out (v,u : out-weight of link(v, u) . It is calculated as the ratio of the number of out-links (outgoing links) from node u and the number of out-links from all references of node v:

out-weight of link

Where, out(v) means all the reference nodes of v ( all nodes to which v links)

At the initial point, each node is initialized with PageRank 1/N and the sum of the PageRank is 1. For this project, you only need to do 10 iterations and compute the page ranks for each paper after the 10th iteration. Please Use loop to do the iterations.


We regard a small citation network consisting of three papers A, B and C, whereby paper A cited paper B and C, paper B cited paper C and paper C cited paper A. The following figure illustrates the link graph of this simple problem.

Show More

Answer :

For solution, connect with our online professionals.