Chair Network: Analyzing a Wikipedia references network
João Pedro da S. Lima (UFRN), Pedro Barbalho P. da Silva (UFRN)
Introduction
Wikipedia is, without doubt, one of the most important websites in the world. As this article is being written, the site has a total of ~55 million articles in ~300 different languages.
Each one of these articles regards a unique subject and represents a small piece of human knowledge.
Wikipedia becomes even more interesting from a Network Analysis perspective. Since almost every single article has links pointing to other wikipedia articles, we can analyse it like a Directed Graph, where a connection from node A to B is made if article A cites article B into its page.
This analysis can help us to visualize the most “important” articles to a given metric, the most cited, the main groups that compose the network, giving useful information to areas like Information Retrival (IR), Data Analysis and Clustering.
In this article, we are gonna focuse on the analysis of a network built from the links in the wikipedia’s CHAIR article.
Building the network
As mentioned before, the wikipedia articles can be represented as a Directed Graph, where a connection from node A to B is made if article A cites article B into its page.
With this concept in mind, we’re gonna build a network following these steps:
- Determine a SEED, a main page from where we gonna start collecting our links
- Determine a LAYER, an integer value that represents how far away we are allowed to go from our SEED. A LAYER of 1, means that we are allowed to include only the neighbours of SEED, A LAYER of 2, means that we can get the neighbours’ neighbours, and so on.
To do this, we’re gonna be using Python with the libraries: NetwotkX, wikipedia.
The chair
One of the basic pieces of furniture, a chair, is a type of seat. Its primary features are two pieces of a durable material, attached as back and seat to one another at a 90° or slightly greater angle… Chair, Wikipedia
The first step is simple, we just need to choose a wikipedia page.
We chose the article of one of the most importants objects in all of humankind history: the chair.
The chair is the perfect example of how a simple object can be adapted along time, space, culture and objective. From the perspective of building a network, we think that the chair article can lead us to various aspects of humankind’s history (with its story), culture (e.g., as a piece of decoration)and science (in the study of ergonomics and materials).
Building Implementation
To build the network, we’re gonna use the BFS strategy (breadth-first search), which is an algorithm to walk in graphs.
The code is based this notebook.
It is based on the ‘layers’ of a graph’s entry, the first layer it’s the entry’s neighbours, the second, the neighbours’ neighbours, and so on. The algorithm is called breadth-first because it consumes the nodes layer by layer.
First of all, we need to import the libraries and determine the SEED and initial conditions.
The BFS’ implementation is quite trivial, and can be seen below:
Running this code, you’ll see the following output:
The full process takes a few minutes and, in the end, the graph g contains our network.
## Visualizing graph info
print(f"{len(g)} nodes, {nx.number_of_edges(g)} edges")> 33190 nodes, 43511 edges
Cleaning the network
Before concluding our building and going to the analysis, we need to clean our network from misleading data.
As the purpose of the network is to adequately represent the relation of a main page or a main subject with its surrounding pages, we need to assure that each node in our graph is representative.
By creating the network with the BFS method (discussed earlier), we get a lot of non-representative nodes, i.e. , nodes with only one neighbour and no-exit. The figure below shows that almost 84% of our network is composed by these nodes.
So, lets remove them.
After this cleaning we end with the folowing stats:
Nodes removed: 83.07%
Edges removed: 63.19%
Data Analysis
To analize the data, we’re going to use Gephi, wich is a widely used graphic visualization software. It eases a lot the work of searching for nice visualizations to the networks.
The first step is to give a nicer appearance to our graph. This is done by applying a few pre-implemented layouts in Gephi. The main layout used was OpenOrd, but a few ajustments were done with other layouts.
The next step is to calculate important statistcsabout our network in Statistics menu, these will be used later to better analyze visually the network.
We calculated the following statistics:
- Modularity
- In & Out Degree
- Betwenness Centrality
With these scores, we can go to the analysis itself.
Visualizing clusters with Modularity Class
From the previous image, we can see that the network naturally shapes itself into a few ‘groups’. In data analysis, is often important to visualize the clusters that compose the data. Fortunally, Gephi implemets the Modularity Class in the Statistics menu, that automatically split the dataset into groups using the Modularity value, previous calculated.
To visualize these groups, we paint each one with a unique color. We also resize the nodes based on their In Degree.
We can visually note that some groups correspond to dense connected subnetwork, where a few of then contains a core article. To ease our understading, lets visualize the name of the nodes with degree > 450.
This simple visualization already can explain most of the network structure. The bigger nodes in each group give us a big hint about the overall subject of these groups, e.g. the brown group wih Barack Obama probabilly has articles talking about USA politcs, the orange group with Wood probabilly talks about materials, and so on.
Visualizing most important nodes
It’s time to answer one of the main questions in Network Analysis: What are the main nodes in our network?.
That might sound like a simple question, but its hard to find a consensus on what ‘importance’ actually means. To answer this question, we’re going to analyse our nodes from two perspectives: Degree Centrality and Betweenness Centrality.
Visualizing most important nodes according to Degree Centrality
Degree centraily is the metric that counts how many connections a node has. In our context, an article with high degree centrality will be either a very cited page, or a page that cites a lot of other articles.
The following image shows the nodes resized by the Degree Centrailty Measure. We also turned of the edges to clean the image up.
By analysing the network by this criteria we come to the conclusion that Paris is the most influential node. That means Paris is the node in which the most links pass through. As it can be seen in the heatmap below, Paris is shown to have direct connections with a big percentage of the other nodes.
Visualizing most important nodes according to Betweenness Centrality
The Betwenness Centrality measures how much a given node participate in the shortest paths between all other nodes.
The following image shows the nodes resized by the BetweennessCentrailty Measure. We also turned of the edges to clean the image.
With no surprises, Chair is the most important node. In fact, the way that the network was created induced this result, as it dictates the whole graph’s flow of information by being it’s origin.
Exploring the Network by yourself
If you wanna explore this network by yourself, you can find the files in this GitHub Repository or visit this link, were we host a webpage with a network visualization created using the Gephi’s Sigma Exporter Plugin.
Conclusion
In this article we show the process of building a network (Directed Graph) scraping Wikipedia citations from a source (SEED) article. The choose of Chair as our SEED really lead us to very a interesting network, with a lot of interesting groups and nodes distributions.
With the network in hands and the Gephi’s help we’re able to made insightful analysis about the data distributions by looking at its clusters. We’re also able to aswer the question: What are the main nodes in our network?. In fact, its hard to decide what is ‘importance’, and we opt to answer this question from two diferent perspectives, considering the Degree and Betweness Centralities.
In summary, we studied the paths a simple object could lead us.
Thank you for reading.