I'm Chris.
April 16th, 2013
March 12th, 2013
March 12th, 2013
Built with python
February 20th, 2013
Built with django, backbone.js, and bootstrap
February 2nd, 2013
Built with python
January 13th, 2013
Built with python and sigma.js
September 12th, 2012
While reading a paper for class, I felt compelled to try my hand at implementing the approach they took. A lot of times I read things, they make some sense, but I don’t really know how much I don’t know about them until I stop reading and try doing. The paper is called Finding and Evaluating Community structure in networks, from 2003.
I read the paper a couple months ago, and the other day started thinking about all the uses for a means of picking out communities within a larger network. Basically, their paper says we should calculate the betweenness of each edge in a graph, and then remove those edges with the highest betweenness. If betweenness is a measure of how often an edge is crossed on a path for every pair of nodes in the graph, then we’ll be removing the edges that are most commonly crossed on a shortest path from node a to b. Eventually, the original graph is split up into smaller graphs, which, from their perspective, carry greater similarity between nodes.
So thinking about this in terms of a real community, I figured, two subreddits, a and b, are connected if a user has two comments c1 and c2 that live in a and b. So this constitutes an edge in the graph, where a node is a subreddit. I used the python reddit wrapper to pull some submissions and comments down, where I then constructed a graph. I figured it would be neat to evaluate this in large sets of data, so I committed the graph to a redis instance. When the data is downloaded, a python script loads the entire data set from redis and begins classifying (the fun part!) communities. The following occurs:
So then I can draw it all in arbor.js
You can view one of the results here. I’m running the classifier on a much larger data set. It’s slow, because of the recalculation step of betweenness at each node removed, so maybe a method like this wouldn’t work as well in production where either you have speed requirements or massive data sets. (or, you know, just profile the code that I left un-profiled). The visualization is a physicsy thing, so if you wait a bit and let it settle, the communities will begin to repel each other, so you can see things better.
tl;dr: graphs and stuff. method of classifying subreddits based on users’ behavior.
Built with python and arbor.js