While reading a paper for class, I felt compelled to try my hand at implementing the approach they took. A lot of times I read things, they make some sense, but I don’t really know how much I don’t know about them until I stop reading and try doing. The paper is called Finding and Evaluating Community structure in networks, from 2003.
I read the paper a couple months ago, and the other day started thinking about all the uses for a means of picking out communities within a larger network. Basically, their paper says we should calculate the betweenness of each edge in a graph, and then remove those edges with the highest betweenness. If betweenness is a measure of how often an edge is crossed on a path for every pair of nodes in the graph, then we’ll be removing the edges that are most commonly crossed on a shortest path from node a to b. Eventually, the original graph is split up into smaller graphs, which, from their perspective, carry greater similarity between nodes.
So thinking about this in terms of a real community, I figured, two subreddits, a and b, are connected if a user has two comments c1 and c2 that live in a and b. So this constitutes an edge in the graph, where a node is a subreddit. I used the python reddit wrapper to pull some submissions and comments down, where I then constructed a graph. I figured it would be neat to evaluate this in large sets of data, so I committed the graph to a redis instance. When the data is downloaded, a python script loads the entire data set from redis and begins classifying (the fun part!) communities. The following occurs:
- Calculate every shortest path for every pair of nodes in the graph
- For each node in the graph, find the fraction of paths that contain that node vs how many don’t. This is betweenness (as per wikipedia’s definition)
- Remove the edge with the highest betweenness (just the betweenness of that start and end nodes added together…maybe this assumption is flawed)
So then I can draw it all in arbor.js
You can view one of the results here. I’m running the classifier on a much larger data set. It’s slow, because of the recalculation step of betweenness at each node removed, so maybe a method like this wouldn’t work as well in production where either you have speed requirements or massive data sets. (or, you know, just profile the code that I left un-profiled). The visualization is a physicsy thing, so if you wait a bit and let it settle, the communities will begin to repel each other, so you can see things better.
tl;dr: graphs and stuff. method of classifying subreddits based on users’ behavior.
Built with python and arbor.js