The Pareto Principle for businesses states that 80% of sales come from 20% of customers. Social media has the same skew; the majority of content comes from a minority of users. I’ve always been curious just how skewed this activity can be. In particular, the skew won’t be the same across different forums. Reddit provides a natural opportunity to measure this skew, there are an enormous number of subreddits with varying levels of activity. Subreddits provide straightforward topic-association, so we can see which topics tend to be more dominated by fewer individuals.
Dataset
I used the corpus of reddit comments collected by /u/Stuck_In_the_Matrix[1]. To keep things manageable I only used one month of data, January 2015. This corpus includes 53,851,542 total comments, including unique 2,512,122 usernames, and (allegedly) represents every public comment on reddit during that time. For simplicity I’ll use the terms username/user/author interchangeably, although strictly speaking a user may have many usernames. I excluded comments which were deleted, or by any of about a dozen highly active bots (AutoModerator, autowikibot, etc.). Only included subreddits with at least 1000 users were included.
Totals: 1,322 subreddits. 2,512,103 users. 42,033,578 comments. Woot!
The metric I’ll be presenting here is the Gini coefficient. It was developed for measuring wealth inequality, and can be applied to any frequency distribution. It takes a value of 0 for a perfectly equal distribution, and 1 for completely unequal (1 person responsible for all the wealth/comments, none by anybody else). The full processed statistics are available in a fusion table here, including some other statistics not discussed in this post.
Results
The distribution of Gini coefficients across subreddits is shown below. The average value is 0.59, with most subreddits falling between 0.4 and 0.8.
I was a little surprised by this plot, mainly by the spread. An average Gini of 0.59 seems reasonably, that’s an intense skew. For reference, the Gini coefficient of US income is 0.47. There is a very wide spread, though. Some subreddits are very highly condensed, and some are much more egalitarian.
We can also look at the most and least egalitarian subreddits. Here are the top and bottom 50 subreddits by Gini index:
SpeculationDiscussion
The low-gini category seems likely mostly pictures; /r/gif, /r/AnimalsBeingBros, /r/AnimalsBeingJerks, /r/HumanPorn (for human portraits). /r/millionairemakers is an interesting sub; a random winner is chosen and subscribers are asked to donate $1, with the hope of making one person a millionaire. They haven’t made any millionaires but they’ve made some people a few thousand. Among the other high-Gini subs we see sports-related subs ( /r/nfl, /r/BostonBruins ) and some other entertainment subs. /r/RWBY is an anime web series, /r/MLPLounge and /r/mylittlepony are both present. Sidenote: This might be the first time I’ve seen /r/Coontown (racist) and /r/GamerGhazi (anti-gamergate / pro-social-justice) so close to each other[2].
Putting these together, it seems like more casual subreddits have the lowest Gini. Nobody is super-passionate about /r/Eyebleach, you just go to look at pretty pictures. The high-gini subs have topics which people get a bit more passionate about; their favorite show, sports team, video game, etc. There are exceptions; /r/arresteddevelopment is a low-Gini subreddit for instance. A small core of extremely passionate individuals is what makes a high-Gini environment. I’m sure many users on /r/cigars just want a few tips, but I’m equally sure some people are mega-serious about getting the best damn cigar they can.
Caveats
Since this is a complete dataset there shouldn’t be much in the way of selection biases. There were 3 million deleted comments which represent ~6% of the total, not a huge amount. Also, there is no way to link usernames to people, so interpretation gets a little bit complicated. Without knowing how many alternate accounts exist, or the commenting behavior on them, it’s hard to know how inequality in comments-per-username translates into inequality in comments-per-real-person.
Also, I excluded the most prolific bots, but there are likely some I missed. A prolific bot targeted at a specific subreddit will have very high activity and could cause an artificially high Gini index.
Shout-Outs
Most Number of Subreddits: /u/codex561, who commented in 1,109 different subreddits. Way to go! [3]
Largest Number of Comments: /u/Doctor-Kitten, who commented an astonishing 13,380 times, more than any other non-bot user!
Highest Total Comment Karma: /u/PainMatrix, who commented 1,361 times, achieving a total score (net up and down) of 187,374!
-Jacob
- [1]Original reddit post: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/. Internet archive link: https://archive.org/details/2015_reddit_comments_corpus↩
- [2]/r/goodyearwelt I’m sorry to have put you in the middle of these two, but it wasn’t my fault, it was the math!↩
- [3] Honorable mention to /u/lolhaibai at 2,252 subreddits, but who was disqualified from these pointless awards because they deleted their account. And dishonorable mention to /u/VeryAwesome69, who had activity on 1,116 subreddits, each and every comment being the question “Ever had a dream where all your friends were dinosaurs?”. I have not.↩