Estimating active reddit users

I’m always curious about how much activity subreddits have, and how the comments are representative of the userbase. It’s well known that the majority of people are lurkers, who just view content but don’t vote or comment. Some subset of those people will actually create accounts and vote on stuff, and some subset of those people will comment. We can count total users (lurkers + voters + commenters) via traffic statistics, and we can count commenters just by analyzing comments.

But how do we count voters? That is, people with accounts who vote but don’t comment. That data isn’t available, all we have to go on are the comments and their vote counts. Statistics to the rescue!

Chao’s Estimator

Here I borrow a technique from ecology used to estimate species richness. Say Ecol O. Gist wants to know how many species of plant exist in a swamp. Well they can go through and count how many they find, but the only way to guarantee that is accurate is to do an exhaustive search of every single plant. Which is impossible.

So instead Dr. Gist comes up with a plan. Go to the swamp, grab a sample of plants, note which species they see. Then come back the next day and do it again. Then come back the next day and do it again. Rinse and repeat a bunch of times. If they continue to see mostly new species each day, that means they haven’t come close to getting a complete count. Once they start to see mostly duplicates, it’s a good bet they’ve seen all the species.

The above hand-wavey explanation is the intuition behind Chao’s estimator of population size [1]. This estimator is very general, it does not assume equal “catchability” (ie probability of being sampled) across all things being sampled. Which is good, it’s much easier to find a tree in ones sample than a particular type of moss. Applied to our case of interest, it’s easier to sample a highly active (~1000 comments) user than a rarely active (~1 comment) one.

This estimator does assume a consistent probability per class, meaning that a tree is just as likely to be sampled one day as the next, or that a user who comments on 10% of days has a 10% chance of commenting on each day independently of the others. This is definitely not true for reddit (weekdays 9-5 are the most active times). It also assumes a fixed population size, which is also not true for reddit (subreddits grow and shrink). Both of these are limitations which the reader should keep in mind.

The relevant formulas that I used are:

(1)   \begin{eqnarray*} \hat{N} = S + \frac{f_1(f_1-1)}{2(f_2 + 1)} \approx S + \frac{f_1^2}{2 f_2} \\ \hat{\sigma}^2 = f_2 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2  [ 0.25 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2 + \bigg(\frac{f_1}{(f_2 + 1)}\bigg) + 0.5] \\ \approx \frac{f_1^2}{f_2 } [ 0.25 (f_1/f_2)^2 + f_1/f_2+ 0.5] \end{eqnarray*}

Where f_1 is the number of users observed exactly once, f_2 the number observed exactly twice, S is the total number of users observed, \hat{N} is the estimated number of total users (observed + unobserved), and \hat{\sigma}^2 the estimated variance of\hat{N}. I used the bias-corrected formulas (which have f_2+1) for calculations as they are more accurate and account for f_2=0; the formulas without that correction are much more readable and shown as approximations.

Methods

The dataset was the same one I’ve used previously, all reddit comments from January 2015, separated by subreddit. I required 1000 unique commenters for a subreddit to be included, which left 1322 subreddits. A “sampling period” was considered to be one calendar day. So f_1 is the number of users who commented on exactly one day, regardless of how many times they commented on that day.

A natural method to check for accuracy is to include more and more data and check for convergence. A major drawback to this approach is that it assumes fixed population size (ie no growth in the size of the subreddit); balancing these concerns is part of the reason I stuck to one month of data. 31 days is a decent sample size, and most subreddits won’t grow too much during that time. Still, to counteract any time-related trends I shuffled the days being included, so an increase in users would just look like a fluctuation.

Results

The results for two subreddits are shown below.

example_users_vs_days

These subreddits are roughly similar in size, which is why I picked them. Notice how in the last 5 days the /r/programming curve is fairly flat, while the /r/cars curve continues to climb.

There needs to be some way to determine if something has “converged”. The metric I used was comparing the average change over the last 5 days to the standard deviation[2]. The motivation here is that of course the estimate will fluctuate due to chance, and the standard deviation gives a measure of that. In the case of /r/programming, the estimate only changed by only 255 users, while the standard deviation was 430. Meanwhile at /r/cars, the estimate changed by 1500 users, with a standard deviation of 329.

Of the 1322 subreddits, 446 of them “converged. A scatterplot of the estimated number of accounts vs directly observed commenters is shown below. The red dots show subreddits which did not converge (estimate using all data shown), the blue triangles show subreddits which did converge. Error bars not shown because they are smaller than the points being plotted on a log-log scale.

est_vs_observed_scatter2

Immediately a trend is apparent: this method only converges for small subreddits. There is also a good linear fit of estimated total vs observed commenters which applies to both converged and nonconverged. The pessimistic view is that this means even the converged estimates are invalid, the optimistic view is that even the nonconverged estimates may be valid. It could also be that the true trend is non-linear, for which the linear value is an approximation, making the convergent estimates valid and the nonconvergent estimates invalid.

A histogram of the estimated/observed ratios is shown below:

est_vs_obs_ratio_hist

 

So in general we estimate there are generally 2x as many potential commenters (ie voters) as there are actual commenters. This is lower than I expected based on the 90/9/1 rule, which would put it at around 9x. Then again, the quantity I’m attempting to measure here is a bit odd. It’s people with accounts who could participate, but don’t. Whereas the “9” in that rule refers to people who sparsely participate; that can include commenting as well as voting.

It’s worth noting the outliers, /r/nameaserver and /r/millionairemakers [3]. /r/nameaserver is by invitation of people who buy gold at the right time, people get together and vote to decide on a server name. They have 24 hours to decide. So most people will only comment on one day, leading to an enormous overestimate.

/r/millionairemakers is subreddit which holds drawings once a month, a winner is picked and people are called to donate. The way you enter the drawing is by commenting on the relevant thread. So basically there are going to be a lot of users which comment exactly once on one day.

Both of these subreddits violate the equiprobable assumption of Chao’s estimator, so it’s no surprise the results are odd. The extreme-skepticism view would be to say that this invalidates all the results above. I’m a bit more optimistic; since there are obvious outliers with an obvious explanation for why they are outliers, it may be that the method works well for subreddits which don’t have pathological commenting patterns.

It’s analyses like this that make we want to work at reddit, or at least have access to their whole dataset. In most cases Chao’s estimator will provide an underestimate, so the 2:1 ratio likely is too low. Without tracking users pageviews and votes, though, there is no way to know for sure. It would be really amazing if this statistical magic actually provided an accurate answer.

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Chao, A. 1987. Estimating the Population Size for Capture-Recapture Data with Unequal Catchability. Biometrics 43: 783-791
  2. [2]Specifically, I performed a linear regression on the last 5 days and calculated the change based on the slope
  3. [3]These also had the lowest Gini coefficients in my last analysis
This entry was posted in reddit, Statistics. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *