Reddit describes itself as the “front page of the internet”, and given how many users it has, that’s not too far off. It’s divided into subreddits, which can have either broad or narrow topics. These subreddits are (mostly) user-created, with the admins only occasionally to step in to remove them. Thus, subreddits represent an “organic” set of topics on social media.
There have been a few subreddit maps created before like Vizit  which was based on cross-posts. Here I’m interested measuring overlap of users; that is, how many users are in common between different subreddits. (Correction: I originally thought redditviz was based on crossposts, but it’s not, it’s based on users, so check that out for a larger version of the same idea). This presented some practical difficulties because scraping comments is a lot more demanding than scraping posts, I started with comments for 2,000 subreddits. After removing low-weight edges to remove noise, and removing isolated subreddits, I ended up with about 900.
The full map can be viewed here
The networks (pre- and post- filtering) are available here.
I downloaded the list of top 2,000 subreddits, by subscriber number, from redditmetrics.com. Based on that list, I requested the top 500 links from the previous month for each subreddit, on Oct 11, 2014. Reddit has an API limit of 1 request / 2 seconds, which ended up squashing some of my ambitions. I waited a week for comments to happen, and then started downloading the comments for each link. I used the praw library to download this info, and store it in a MongoDB database.
The first 200 comments can be got for each link in one request, after that one has to request “more”. I set a limit of 100 additional calls, each one of those can only get a maximum of 10 comments, so that’s a maximum of 1200 comments. Each link returns the number of comments it has, so I have those statistics as well. We only have text and comment info on a max of 1200 commenters. For most threads in most subreddits that’s ample, but not the larger ones.
Once the comments were downloaded, I started counting. For each thread I identified the unique users (a user was only counted once per thread), and for each subreddit counted the number of threads each user appeared in. I boiled down the list of users for a subreddit to those users which appeared in 2 or more threads, to mitigate outliers.
I determined the “connection” between subreddits by Jaccard coefficient of the users: the intersection of the user set divided by the union.
I used an arbitrary cutoff of 0.01 and cut any connection below that value to 0. I also required an absolute value of 10 users overlap, as the cutoff of 0.1 amount to only 2-3 users for some niche subreddits.
As with previous subreddit maps, I removed the major subreddits. In my case, that was any subreddit with more than a million subscribers. To create the network, I used networkx , starting with a spring layout. I exported that network to gexf, imported it to Gephi, and did much more manual layout work.
To detect/create the clusters, I used the Modularity metric. This metric defines a community based on the number of ingroup connections versus outgroup. The nodes are colored by their modularity. I manually examined these groups and came up with a name which I thought was descriptive; some are more obvious than others.
Visualization was done using the sigma.js export function from Gephi, with some customization on my part. This plugin was developed by the Oxford Internet Institute.
Results and Discussion
The modularity algorithm identified 28 communities, ranging in size from 3 to 95. The average degree of the nodes is 10, median degree 2. This distribution isn’t a power law, as many social networking graphs are, but is skewed. Note that this distribution is after all my cutting/trimming/etc. A histogram of the connections per subreddit, after trimming, is below.
One surprising result was that the number of comments per thread had very little correlation to the number of subscribers. Oh sure the really tiny subreddits have few comments, and the really big have a lot, but in the middle it’s all over the place. Evidently the topic matters more than the sheer number of people.
The subreddits are sized by the number of unique commenters, not by the number of comments. So it’s more directly a measure of the size of the userbase. One might also be interested in how active a subreddit is, that is, the number of comments per thread. The two quantities are highly correlated (see below) for any reasonably large subreddit. There’s no way to measure lurkers though.
First things first: porn. There are 2 distinct porn communities, which at first I dismissed as a clustering artifact but on closer examination I believe the distinction is meaningful. I named them “Porn” and “GoneWild”. The “Porn” subreddits are for pornographic images taken from anywhere. There are a bunch, divided into different interests. This is where one would post content from their favorite porn site.
The GoneWild cluster is centered around redditors posting pornographic pictures of themselves. There are the usually interests represented; asians, nerdy, BBW, largely with “gonewild” or “gw” in the name. These two clusters are much closer to each other than the rest of reddit, so there is clearly some overlap, but I thought the distinction was interesting. Also /r/dykesgonewild is in this cluster, and not the QUILTBAG one.
There are a few subs that are apparently on the border between Gender_Relationships and Making_Fun; /r/pettyrevenge, /r/prorevenge, and /r/TalesFromRetail. Some more emotionally charged subs are here too, like /r/confession, /r/offmychest, and /r/raisedbynarcissists. Which is why I called this cluster “Gender_Relationships”, it seems to be about any issue or any kind of relationship that is in any way gendered.
I wonder if there are 2 subclusters in the MakingFun cluster, because there are definite political divides. /r/TumblrInAction is for making fun of tumblr, and has a definite anti-social-justice bent. It does not directly connect to /r/ShitRedditSays, which has a pro-social-justice bent. TiA links to the fat-hate triple, while SRS does not. They both have /r/SubredditDrama, /r/lewronggeneration, and /r/forwardsfromgrandma in common, but that’s about it. The shitlords on TiA have a wider reach, judging by the number of connected subreddits, than the SRS fempire. Yay patriarchy!
I could go on and on about interesting connections.What do guns and fountain-pens have in common? Answer: You carry them both around every day. See /r/EDC and its neighbors. /r/girlgamers is halfway between Gender_Relationships and ComputerGames; /r/Seattle likes bikes; and Brits like soccer (although I’m guessing an American started /r/soccer, given the name). Most connections I’m seeing make perfect sense in retrospect, but I wouldn’t necessarily have guessed them ahead of time. That suggests subreddits can be used for topic-association. And I haven’t even touched the actual words in comments!
Reddit is a rich dataset for datamining, given it’s division into subreddits and clusters thereof. Feel free to poke around the map, point out anything interesting you see, and/or give me feedback on the methods and visualization. I’ve spent hours staring at the pretty colors, hopefully it’s as informative as it is beautiful.
- Vizit. http://redditstuff.github.io/sna/vizit/↩
- Where the same link is posted to multiple subreddits↩
- Redditviz. http://arxiv.org/abs/1312.3387 http://rhiever.github.io/redditviz/↩
- PRAW: The Python Reddit Api Wrapper https://praw.readthedocs.org/en/v2.1.20/index.html↩
- Fast unfolding of communities in large networks http://arxiv.org/abs/0803.0476↩
This was pretty cool!
You might be interested by the use of latent semantic analysis by FiveThirtyEight to analyse similarities between Reddit communities:
this is incredibly cool / helpful! do you know if someone maintains a similar, more up to date version?
This is incredible! I’d be interested in the up to date version too!