Estimating active reddit users

I’m always curious about how much activity subreddits have, and how the comments are representative of the userbase. It’s well known that the majority of people are lurkers, who just view content but don’t vote or comment. Some subset of those people will actually create accounts and vote on stuff, and some subset of those people will comment. We can count total users (lurkers + voters + commenters) via traffic statistics, and we can count commenters just by analyzing comments.

But how do we count voters? That is, people with accounts who vote but don’t comment. That data isn’t available, all we have to go on are the comments and their vote counts. Statistics to the rescue!

Chao’s Estimator

Here I borrow a technique from ecology used to estimate species richness. Say Ecol O. Gist wants to know how many species of plant exist in a swamp. Well they can go through and count how many they find, but the only way to guarantee that is accurate is to do an exhaustive search of every single plant. Which is impossible.

So instead Dr. Gist comes up with a plan. Go to the swamp, grab a sample of plants, note which species they see. Then come back the next day and do it again. Then come back the next day and do it again. Rinse and repeat a bunch of times. If they continue to see mostly new species each day, that means they haven’t come close to getting a complete count. Once they start to see mostly duplicates, it’s a good bet they’ve seen all the species.

The above hand-wavey explanation is the intuition behind Chao’s estimator of population size [1]. This estimator is very general, it does not assume equal “catchability” (ie probability of being sampled) across all things being sampled. Which is good, it’s much easier to find a tree in ones sample than a particular type of moss. Applied to our case of interest, it’s easier to sample a highly active (~1000 comments) user than a rarely active (~1 comment) one.

This estimator does assume a consistent probability per class, meaning that a tree is just as likely to be sampled one day as the next, or that a user who comments on 10% of days has a 10% chance of commenting on each day independently of the others. This is definitely not true for reddit (weekdays 9-5 are the most active times). It also assumes a fixed population size, which is also not true for reddit (subreddits grow and shrink). Both of these are limitations which the reader should keep in mind.

The relevant formulas that I used are:

(1)   \begin{eqnarray*} \hat{N} = S + \frac{f_1(f_1-1)}{2(f_2 + 1)} \approx S + \frac{f_1^2}{2 f_2} \\ \hat{\sigma}^2 = f_2 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2  [ 0.25 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2 + \bigg(\frac{f_1}{(f_2 + 1)}\bigg) + 0.5] \\ \approx \frac{f_1^2}{f_2 } [ 0.25 (f_1/f_2)^2 + f_1/f_2+ 0.5] \end{eqnarray*}

Where f_1 is the number of users observed exactly once, f_2 the number observed exactly twice, S is the total number of users observed, \hat{N} is the estimated number of total users (observed + unobserved), and \hat{\sigma}^2 the estimated variance of\hat{N}. I used the bias-corrected formulas (which have f_2+1) for calculations as they are more accurate and account for f_2=0; the formulas without that correction are much more readable and shown as approximations.


The dataset was the same one I’ve used previously, all reddit comments from January 2015, separated by subreddit. I required 1000 unique commenters for a subreddit to be included, which left 1322 subreddits. A “sampling period” was considered to be one calendar day. So f_1 is the number of users who commented on exactly one day, regardless of how many times they commented on that day.

A natural method to check for accuracy is to include more and more data and check for convergence. A major drawback to this approach is that it assumes fixed population size (ie no growth in the size of the subreddit); balancing these concerns is part of the reason I stuck to one month of data. 31 days is a decent sample size, and most subreddits won’t grow too much during that time. Still, to counteract any time-related trends I shuffled the days being included, so an increase in users would just look like a fluctuation.


The results for two subreddits are shown below.


These subreddits are roughly similar in size, which is why I picked them. Notice how in the last 5 days the /r/programming curve is fairly flat, while the /r/cars curve continues to climb.

There needs to be some way to determine if something has “converged”. The metric I used was comparing the average change over the last 5 days to the standard deviation[2]. The motivation here is that of course the estimate will fluctuate due to chance, and the standard deviation gives a measure of that. In the case of /r/programming, the estimate only changed by only 255 users, while the standard deviation was 430. Meanwhile at /r/cars, the estimate changed by 1500 users, with a standard deviation of 329.

Of the 1322 subreddits, 446 of them “converged. A scatterplot of the estimated number of accounts vs directly observed commenters is shown below. The red dots show subreddits which did not converge (estimate using all data shown), the blue triangles show subreddits which did converge. Error bars not shown because they are smaller than the points being plotted on a log-log scale.


Immediately a trend is apparent: this method only converges for small subreddits. There is also a good linear fit of estimated total vs observed commenters which applies to both converged and nonconverged. The pessimistic view is that this means even the converged estimates are invalid, the optimistic view is that even the nonconverged estimates may be valid. It could also be that the true trend is non-linear, for which the linear value is an approximation, making the convergent estimates valid and the nonconvergent estimates invalid.

A histogram of the estimated/observed ratios is shown below:



So in general we estimate there are generally 2x as many potential commenters (ie voters) as there are actual commenters. This is lower than I expected based on the 90/9/1 rule, which would put it at around 9x. Then again, the quantity I’m attempting to measure here is a bit odd. It’s people with accounts who could participate, but don’t. Whereas the “9” in that rule refers to people who sparsely participate; that can include commenting as well as voting.

It’s worth noting the outliers, /r/nameaserver and /r/millionairemakers [3]. /r/nameaserver is by invitation of people who buy gold at the right time, people get together and vote to decide on a server name. They have 24 hours to decide. So most people will only comment on one day, leading to an enormous overestimate.

/r/millionairemakers is subreddit which holds drawings once a month, a winner is picked and people are called to donate. The way you enter the drawing is by commenting on the relevant thread. So basically there are going to be a lot of users which comment exactly once on one day.

Both of these subreddits violate the equiprobable assumption of Chao’s estimator, so it’s no surprise the results are odd. The extreme-skepticism view would be to say that this invalidates all the results above. I’m a bit more optimistic; since there are obvious outliers with an obvious explanation for why they are outliers, it may be that the method works well for subreddits which don’t have pathological commenting patterns.

It’s analyses like this that make we want to work at reddit, or at least have access to their whole dataset. In most cases Chao’s estimator will provide an underestimate, so the 2:1 ratio likely is too low. Without tracking users pageviews and votes, though, there is no way to know for sure. It would be really amazing if this statistical magic actually provided an accurate answer.


Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Chao, A. 1987. Estimating the Population Size for Capture-Recapture Data with Unequal Catchability. Biometrics 43: 783-791
  2. [2]Specifically, I performed a linear regression on the last 5 days and calculated the change based on the slope
  3. [3]These also had the lowest Gini coefficients in my last analysis
Posted in reddit, Statistics | Leave a comment

The comments of the few outweigh the comments of the many

The Pareto Principle for businesses states that 80% of sales come from 20% of customers. Social media has the same skew; the majority of content comes from a minority of users. I’ve always been curious just how skewed this activity can be. In particular, the skew won’t be the same across different forums. Reddit provides a natural opportunity to measure this skew, there are an enormous number of subreddits with varying levels of activity. Subreddits provide straightforward topic-association, so we can see which topics tend to be more dominated by fewer individuals.


I used the corpus of reddit comments collected by /u/Stuck_In_the_Matrix[1]. To keep things manageable I only used one month of data, January 2015. This corpus includes 53,851,542 total comments, including unique 2,512,122 usernames, and (allegedly) represents every public comment on reddit during that time. For simplicity I’ll use the terms username/user/author interchangeably, although strictly speaking a user may have many usernames. I excluded comments which were deleted, or by any of about a dozen highly active bots (AutoModerator, autowikibot, etc.). Only included subreddits with at least 1000 users were included.

Totals: 1,322 subreddits. 2,512,103 users. 42,033,578 comments. Woot!

The metric I’ll be presenting here is the Gini coefficient. It was developed for measuring wealth inequality, and can be applied to any frequency distribution. It takes a value of 0 for a perfectly equal distribution, and 1 for completely unequal (1 person responsible for all the wealth/comments, none by anybody else). The full processed statistics are available in a fusion table here, including some other statistics not discussed in this post.


The distribution of Gini coefficients across subreddits is shown below. The average value is 0.59, with most subreddits falling between 0.4 and 0.8.


I was a little surprised by this plot, mainly by the spread. An average Gini of 0.59 seems reasonably, that’s an intense skew. For reference, the Gini coefficient of US income is 0.47. There is a very wide spread, though. Some subreddits are very highly condensed, and some are much more egalitarian.

We can also look at the most and least egalitarian subreddits. Here are the top and bottom 50 subreddits by Gini index:


The low-gini category seems likely mostly pictures; /r/gif, /r/AnimalsBeingBros, /r/AnimalsBeingJerks, /r/HumanPorn (for human portraits). /r/millionairemakers is an interesting sub; a random winner is chosen and subscribers are asked to donate $1, with the hope of making one person a millionaire. They haven’t made any millionaires but they’ve made some people a few thousand. Among the other high-Gini subs we see sports-related subs ( /r/nfl, /r/BostonBruins ) and some other entertainment subs. /r/RWBY is an anime web series, /r/MLPLounge and /r/mylittlepony are both present. Sidenote: This might be the first time I’ve seen /r/Coontown (racist) and /r/GamerGhazi (anti-gamergate / pro-social-justice) so close to each other[2].

Putting these together, it seems like more casual subreddits have the lowest Gini. Nobody is super-passionate about /r/Eyebleach, you just go to look at pretty pictures.  The high-gini subs have topics which people get a bit more passionate about; their favorite show, sports team, video game, etc. There are exceptions; /r/arresteddevelopment is a low-Gini subreddit for instance. A small core of extremely passionate individuals is what makes a high-Gini environment. I’m sure many users on /r/cigars just want a few tips, but I’m equally sure some people are mega-serious about getting the best damn cigar they can.


Since this is a complete dataset there shouldn’t be much in the way of selection biases. There were 3 million deleted comments which represent ~6% of the total, not a huge amount. Also, there is no way to link usernames to people, so interpretation gets a little bit complicated. Without knowing how many alternate accounts exist, or the commenting behavior on them, it’s hard to know how inequality in comments-per-username translates into inequality in comments-per-real-person.

Also, I excluded the most prolific bots, but there are likely some I missed. A prolific bot targeted at a specific subreddit will have very high activity and could cause an artificially high Gini index.


Most Number of Subreddits: /u/codex561, who commented in 1,109 different subreddits. Way to go! [3]

Largest Number of Comments: /u/Doctor-Kitten, who commented an astonishing 13,380 times, more than any other non-bot user!

Highest Total Comment Karma: /u/PainMatrix, who commented 1,361 times, achieving a total score (net up and down) of 187,374!


Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Original reddit post: Internet archive link:
  2. [2]/r/goodyearwelt I’m sorry to have put you in the middle of these two, but it wasn’t my fault, it was the math!
  3. [3] Honorable mention to /u/lolhaibai at 2,252 subreddits, but who was disqualified from these pointless awards because they deleted their account. And dishonorable mention to /u/VeryAwesome69, who had activity on 1,116 subreddits, each and every comment being the question “Ever had a dream where all your friends were dinosaurs?”. I have not.
Posted in Uncategorized | Leave a comment

More on the Bechdel Test

I gave some theoretical insights on the Bechdel test in a previous post, but silly me, of course there is real data! The Cornell Movie-Dialogs Corpus[1] contains conversations between characters in 617 movies.

Conversations in this corpus are already separated, so it’s easy to tell when two people are talking to each other. Most characters are annotated with a gender. Most, but not all. I inferred gender based on the census’ list of popular boys and girls names[2], this added some more information. All in all there were 9,035 characters: 3,027 male, 1,572 female, and 4,436 unknown. Lots of unknowns unfortunately, which means I wouldn’t trust these numbers too much on an absolute scale.

We do have a natural comparison. The actual Bechdel test requires two women talking to each other about something other than a man. We can easily construct a male version: two men talking to each other about something other than a woman. I’ll be comparing these quantities.

Character Ratios

First a quick pass through to count the number of male/female characters. I took the log2 ratio of male/female characters so that the view would be symmetric. A perfectly balanced cast would be at 0, +1 means twice as many male characters, -1 means twice as many female.


The overall median is a 2:1 ratio of male:female characters, and it’s remarkable consistent across genres. There is a pretty wide variance, which may be due to the incomplete gender-tagging of names in the corpus.


Now the hard part. We need to identify conversations which are between two women only, and about something other than a man. I’m also doing the reverse, identifying conversations between two men which are about something other than a woman, for comparison.

Checking the gender is straightforward (it’s either annotated in the database or its not) and I’m only counting convos that pass if both characters are KNOWN to be women(men). So characters with unknown gender are excluded.

Checking the topic is a bit harder. The method I’m using is simple: check for the presence of a male(female) character name (in the same movie) in the conversation, as well as known male(female) pronouns. Obviously this isn’t perfect, but since I’m doing an apples-to-apples comparison between men and women any flaws should balance out. Technically the Bechdel test only requires 1 passing conversation, for robustness in this analysis I required 2 per movie.


Number of Movies Passing Each Version


Fraction of Movies in Genre Passing Each Version

The top graph shows movies by total count, the bottom shows by fraction. Nearly all movies pass at least 1 version. About 75% of movies (red + blue) pass the male version, while about 40% (blue + purple) pass the female version. Action and adventure movies are the most male-biased (surprise!)[3]

Romance, comedy, and horror come the closest to parity. I’m surprised about the last category, I would’ve that horror would be male-dominated.  And even animation had very few movies passing; won’t somebody think of the children! There were only 10 movies in this genre though so it may not be representative.

Looking only at movies which passed each respective test, we can see how many passing conversations existed:


This may be a bit hard to read. Blue is female, red is male, they’re next to each other by genre, and the y-axis is the number of passing conversations per movie (on a log10 scale). For the most part, movies which pass the male Bechdel test pass a whole lot more than then female. The median number of male-passing conversations is about 40, for female it’s only 10.

That’s a 4:1 ratio, twice as much as the 2:1 ratio we saw of characters. Which is what one might expect given the bias for male charecters, as the number of possible conversation pairs are ~(number of characters)^2. Or it could be that the male characters are more prominent in the story, and hence occupy more screentime.

Other Resources has an enormous manually curated list of movies and their passing status. This post also has some excellent visualizations, based on a much larger set of movies. And near and dear to my heart, there’s an analysis of every Star Trek episode on The Mary Sue Blog.


Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (CMCL ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 76-87.
  2. [2]
  3. [3]Neither of the modern Tombraider movies pass (according to, despite starring a woman, because she’s the only one
Posted in Text Mining | Leave a comment

Some musings on statistics

A) Beware of The Wrong Summary Statistics

SlateStarCodex had a pretty interesting post entitle “Beware of Summary Statistics“, showing how they can be misleading. This isn’t exactly new, there are famous examples of how just looking at the mean and standard deviation greatly oversimplifies; distributions can have the exact same mean/stdev. but be very different[1]. The main lesson to take-away is to always visualize your data.

If you know the distribution in question, though, there is probably a good summary statistic for it. The go-to in social science is pearson correlation. SSC gave an example of two variables which appeared to be correlated but that correlation was highly misleading. Here are two “uncorrelated” variables:


The linear fit shows that X and Y are uncorrelated. The pearson correlation is nearly 0. However that is obviously BS as there is a clear trend, just not linear, which is what pearson correlation captures. With the benefit of viewing the data[2], we can correlate Y vs inverse-sin(Y)  (orange). That correlation is 0.99.  The real relationship is Y= Asin(fX), where A = 1 and f = 1. A mean/standard deviation for this would be meaningless, but amplitude/frequency would describe it perfectly.

Of course this is a rigged example, and I generated the data from a sine wave. In a real-world example, one sometimes knows (has some idea) what shape the distribution will be. If one doesn’t, visualize it and figure it out.

B) Exact Wording Matters

The most famous example I know of is an old study by the Gates Foundation showing that the best schools are small schools. So obviously we need to look at small schools and see why they’re so great, right? Well, no, because the worst schools are also small schools. Small school -> small sample size -> high variance, meaning the outliers are always going to be found in smaller sample sizes:


Source: The Promise and Pitfalls of Using Imprecise School Accountability Measures
Thomas J. Kane and Douglas O. Staiger.[3]

One of the earliest papers on cognitive biases looked at this[4], they asked people if large hospitals or small hospitals are more likely to have more days where >60% of babies born on that day were male. Most people said the same, because the odds of being born male are the same for any particular baby in either case. But pay closer attention to that wording; it wasn’t about the overall average, it was about the variance. Simpler example: If you flip two quarters at a time, occasionally they’ll all (re: both) come out heads. If you flip 10 quarters at a time, very rarely will they all be heads.

C) Confounders and Conditional (In)dependence

I love Simpson’s Paradox . Trends which exist in aggregated data can reverse direction when data is broken into subgroups. In the most general case, if subgroups exist, a trend which applies to the aggregate doesn’t have to exist in subgroups, and if it does, doesn’t have to be in the same direction. And vice versa going the other direction, from subgroup to overall.


In the above chart, Y has an overall linear trend against X. But once it’s known whether the point is in S1 or S2, the dependence goes away. So Y is conditionally independent of X. Interpretation will depend on the problem situation. If the difference between S1 and S2 is something we care about, it’s interesting and we publish a paper. Champagne for everybody! If not, it’s a confounder (boo! hiss!).

The easiest way to deal with confounders is to analyze groups separately. Say you’re interested in discovering people that walk fast spend more on shoes. Well age affects walking speed, so to remove that confounder, one could stratify patients into different groups. Confounder removed! It’s a good idea, and it has two serious drawbacks:

1. Each group has a smaller sample size, which increases the variance.

2. Testing multiple groups means testing multiple hypotheses.

These errors compound each other. We’ve got several smaller sample sizes meaning the variance is larger, so the odds of getting at least one false positive gets much larger (see section B)[5]. The social science studies I read never correct for multiple hypotheses, gee I wonder why :-).

Closing Thought

While finishing this post I came across an article about a deliberate scientific “fraud”. The authors did the experiment they said, didn’t make up any data; the only thing which makes this fraud different from so many others is that the authors are publicly saying the result is bullshit. I almost typed “the authors *knew* the result is bullshit” except I’m sure most other snake-oil salesmen know that too. Life is complicated, so don’t trust anybody selling easy answers.



Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]e.g. Anscombe’s Quartet.
  2. [2]and that I generated it
  3. [3]  Journal of Economic Perspectives—Volume 16, Number 4—Fall 2002—Pages 91–114.  Figure 2.
  4. [4]Judgment under Uncertainty: Heuristics and Biases. Amos Tversky; Daniel Kahneman
    Science, New Series, Vol. 185, No. 4157. (Sep. 27, 1974), pp. 1124-1131. 
  5. [5]SSC calls this the “Elderly Hispanic Woman Effect”
Posted in Statistics | Leave a comment

Subreddit Map

Reddit describes itself as the “front page of the internet”, and given how many users it has, that’s not too far off. It’s divided into subreddits, which can have either broad or narrow topics. These subreddits are (mostly) user-created, with the admins only occasionally to step in to remove them. Thus, subreddits represent an “organic” set of topics on social media.

There have been a few subreddit maps created before like Vizit [1] which was based on cross-posts[2]. Here I’m interested measuring overlap of users; that is, how many users are in common between different subreddits. (Correction: I originally thought redditviz[3] was based on crossposts, but it’s not, it’s based on users, so check that out for a larger version of the same idea). This presented some practical difficulties because scraping comments is a lot more demanding than scraping posts, I started with comments for 2,000 subreddits. After removing low-weight edges to remove noise, and removing isolated subreddits, I ended up with about 900.

The full map can be viewed here

The networks (pre- and post- filtering) are available here.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Vizit.
  2. [2]Where the same link is posted to multiple subreddits
  3. [3]Redditviz.
Posted in reddit, Social Media, Text Mining | 2 Comments