The comments of the few outweigh the comments of the many

The Pareto Principle for businesses states that 80% of sales come from 20% of customers. Social media has the same skew; the majority of content comes from a minority of users. I’ve always been curious just how skewed this activity can be. In particular, the skew won’t be the same across different forums. Reddit provides a natural opportunity to measure this skew, there are an enormous number of subreddits with varying levels of activity. Subreddits provide straightforward topic-association, so we can see which topics tend to be more dominated by fewer individuals.

Dataset

I used the corpus of reddit comments collected by /u/Stuck_In_the_Matrix[1]. To keep things manageable I only used one month of data, January 2015. This corpus includes 53,851,542 total comments, including unique 2,512,122 usernames, and (allegedly) represents every public comment on reddit during that time. For simplicity I’ll use the terms username/user/author interchangeably, although strictly speaking a user may have many usernames. I excluded comments which were deleted, or by any of about a dozen highly active bots (AutoModerator, autowikibot, etc.). Only included subreddits with at least 1000 users were included.

Totals: 1,322 subreddits. 2,512,103 users. 42,033,578 comments. Woot!

The metric I’ll be presenting here is the Gini coefficient. It was developed for measuring wealth inequality, and can be applied to any frequency distribution. It takes a value of 0 for a perfectly equal distribution, and 1 for completely unequal (1 person responsible for all the wealth/comments, none by anybody else). The full processed statistics are available in a fusion table here, including some other statistics not discussed in this post.

Results

The distribution of Gini coefficients across subreddits is shown below. The average value is 0.59, with most subreddits falling between 0.4 and 0.8.

gini_histogram2

I was a little surprised by this plot, mainly by the spread. An average Gini of 0.59 seems reasonably, that’s an intense skew. For reference, the Gini coefficient of US income is 0.47. There is a very wide spread, though. Some subreddits are very highly condensed, and some are much more egalitarian.

We can also look at the most and least egalitarian subreddits. Here are the top and bottom 50 subreddits by Gini index:

SpeculationDiscussion

The low-gini category seems likely mostly pictures; /r/gif, /r/AnimalsBeingBros, /r/AnimalsBeingJerks, /r/HumanPorn (for human portraits). /r/millionairemakers is an interesting sub; a random winner is chosen and subscribers are asked to donate $1, with the hope of making one person a millionaire. They haven’t made any millionaires but they’ve made some people a few thousand. Among the other high-Gini subs we see sports-related subs ( /r/nfl, /r/BostonBruins ) and some other entertainment subs. /r/RWBY is an anime web series, /r/MLPLounge and /r/mylittlepony are both present. Sidenote: This might be the first time I’ve seen /r/Coontown (racist) and /r/GamerGhazi (anti-gamergate / pro-social-justice) so close to each other[2].

Putting these together, it seems like more casual subreddits have the lowest Gini. Nobody is super-passionate about /r/Eyebleach, you just go to look at pretty pictures.  The high-gini subs have topics which people get a bit more passionate about; their favorite show, sports team, video game, etc. There are exceptions; /r/arresteddevelopment is a low-Gini subreddit for instance. A small core of extremely passionate individuals is what makes a high-Gini environment. I’m sure many users on /r/cigars just want a few tips, but I’m equally sure some people are mega-serious about getting the best damn cigar they can.

Caveats

Since this is a complete dataset there shouldn’t be much in the way of selection biases. There were 3 million deleted comments which represent ~6% of the total, not a huge amount. Also, there is no way to link usernames to people, so interpretation gets a little bit complicated. Without knowing how many alternate accounts exist, or the commenting behavior on them, it’s hard to know how inequality in comments-per-username translates into inequality in comments-per-real-person.

Also, I excluded the most prolific bots, but there are likely some I missed. A prolific bot targeted at a specific subreddit will have very high activity and could cause an artificially high Gini index.

Shout-Outs

Most Number of Subreddits: /u/codex561, who commented in 1,109 different subreddits. Way to go! [3]

Largest Number of Comments: /u/Doctor-Kitten, who commented an astonishing 13,380 times, more than any other non-bot user!

Highest Total Comment Karma: /u/PainMatrix, who commented 1,361 times, achieving a total score (net up and down) of 187,374!

-Jacob

Facebooktwitterredditpinterestlinkedinmailby feather
  1. [1]Original reddit post: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/. Internet archive link: https://archive.org/details/2015_reddit_comments_corpus
  2. [2]/r/goodyearwelt I’m sorry to have put you in the middle of these two, but it wasn’t my fault, it was the math!
  3. [3] Honorable mention to /u/lolhaibai at 2,252 subreddits, but who was disqualified from these pointless awards because they deleted their account. And dishonorable mention to /u/VeryAwesome69, who had activity on 1,116 subreddits, each and every comment being the question “Ever had a dream where all your friends were dinosaurs?”. I have not.
Posted in Uncategorized | Leave a comment

Some musings on statistics

A) Beware of The Wrong Summary Statistics

SlateStarCodex had a pretty interesting post entitle “Beware of Summary Statistics“, showing how they can be misleading. This isn’t exactly new, there are famous examples of how just looking at the mean and standard deviation greatly oversimplifies; distributions can have the exact same mean/stdev. but be very different[1]. The main lesson to take-away is to always visualize your data.

If you know the distribution in question, though, there is probably a good summary statistic for it. The go-to in social science is pearson correlation. SSC gave an example of two variables which appeared to be correlated but that correlation was highly misleading. Here are two “uncorrelated” variables:

fake_correlations

The linear fit shows that X and Y are uncorrelated. The pearson correlation is nearly 0. However that is obviously BS as there is a clear trend, just not linear, which is what pearson correlation captures. With the benefit of viewing the data[2], we can correlate Y vs inverse-sin(Y)  (orange). That correlation is 0.99.  The real relationship is Y= Asin(fX), where A = 1 and f = 1. A mean/standard deviation for this would be meaningless, but amplitude/frequency would describe it perfectly.

Of course this is a rigged example, and I generated the data from a sine wave. In a real-world example, one sometimes knows (has some idea) what shape the distribution will be. If one doesn’t, visualize it and figure it out.

B) Exact Wording Matters

The most famous example I know of is an old study by the Gates Foundation showing that the best schools are small schools. So obviously we need to look at small schools and see why they’re so great, right? Well, no, because the worst schools are also small schools. Small school -> small sample size -> high variance, meaning the outliers are always going to be found in smaller sample sizes:

grade_change_vs_school_size

Source: The Promise and Pitfalls of Using Imprecise School Accountability Measures
Thomas J. Kane and Douglas O. Staiger.[3]

One of the earliest papers on cognitive biases looked at this[4], they asked people if large hospitals or small hospitals are more likely to have more days where >60% of babies born on that day were male. Most people said the same, because the odds of being born male are the same for any particular baby in either case. But pay closer attention to that wording; it wasn’t about the overall average, it was about the variance. Simpler example: If you flip two quarters at a time, occasionally they’ll all (re: both) come out heads. If you flip 10 quarters at a time, very rarely will they all be heads.

C) Confounders and Conditional (In)dependence

I love Simpson’s Paradox . Trends which exist in aggregated data can reverse direction when data is broken into subgroups. In the most general case, if subgroups exist, a trend which applies to the aggregate doesn’t have to exist in subgroups, and if it does, doesn’t have to be in the same direction. And vice versa going the other direction, from subgroup to overall.

subgroups

In the above chart, Y has an overall linear trend against X. But once it’s known whether the point is in S1 or S2, the dependence goes away. So Y is conditionally independent of X. Interpretation will depend on the problem situation. If the difference between S1 and S2 is something we care about, it’s interesting and we publish a paper. Champagne for everybody! If not, it’s a confounder (boo! hiss!).

The easiest way to deal with confounders is to analyze groups separately. Say you’re interested in discovering people that walk fast spend more on shoes. Well age affects walking speed, so to remove that confounder, one could stratify patients into different groups. Confounder removed! It’s a good idea, and it has two serious drawbacks:

1. Each group has a smaller sample size, which increases the variance.

2. Testing multiple groups means testing multiple hypotheses.

These errors compound each other. We’ve got several smaller sample sizes meaning the variance is larger, so the odds of getting at least one false positive gets much larger (see section B)[5]. The social science studies I read never correct for multiple hypotheses, gee I wonder why :-).

Closing Thought

While finishing this post I came across an article about a deliberate scientific “fraud”. The authors did the experiment they said, didn’t make up any data; the only thing which makes this fraud different from so many others is that the authors are publicly saying the result is bullshit. I almost typed “the authors *knew* the result is bullshit” except I’m sure most other snake-oil salesmen know that too. Life is complicated, so don’t trust anybody selling easy answers.

-Jacob

 

Facebooktwitterredditpinterestlinkedinmailby feather
  1. [1]e.g. Anscombe’s Quartet. http://en.wikipedia.org/wiki/Anscombe%27s_quartet
  2. [2]and that I generated it
  3. [3]  Journal of Economic Perspectives—Volume 16, Number 4—Fall 2002—Pages 91–114.  Figure 2. http://pubs.aeaweb.org/doi/pdfplus/10.1257/089533002320950993
  4. [4]Judgment under Uncertainty: Heuristics and Biases. Amos Tversky; Daniel Kahneman
    Science, New Series, Vol. 185, No. 4157. (Sep. 27, 1974), pp. 1124-1131. http://psiexp.ss.uci.edu/research/teaching/Tversky_Kahneman_1974.pdf 
  5. [5]SSC calls this the “Elderly Hispanic Woman Effect”
Posted in Statistics | Leave a comment

Subreddit Map

Reddit describes itself as the “front page of the internet”, and given how many users it has, that’s not too far off. It’s divided into subreddits, which can have either broad or narrow topics. These subreddits are (mostly) user-created, with the admins only occasionally to step in to remove them. Thus, subreddits represent an “organic” set of topics on social media.

There have been a few subreddit maps created before like Vizit [1] which was based on cross-posts[2]. Here I’m interested measuring overlap of users; that is, how many users are in common between different subreddits. (Correction: I originally thought redditviz[3] was based on crossposts, but it’s not, it’s based on users, so check that out for a larger version of the same idea). This presented some practical difficulties because scraping comments is a lot more demanding than scraping posts, I started with comments for 2,000 subreddits. After removing low-weight edges to remove noise, and removing isolated subreddits, I ended up with about 900.

The full map can be viewed here

The networks (pre- and post- filtering) are available here.

Continue reading

Facebooktwitterredditpinterestlinkedinmailby feather
  1. [1]Vizit. http://redditstuff.github.io/sna/vizit/
  2. [2]Where the same link is posted to multiple subreddits
  3. [3]Redditviz. http://arxiv.org/abs/1312.3387 http://rhiever.github.io/redditviz/
Posted in reddit, Social Media, Text Mining | 3 Comments

Exaggeration of Science

Communicating scientific results to the public is difficult, even with the best intentions. There are all kinds of subtleties in any study which don’t make it into media coverage. Furthermore, caveats about interpreting results get lost along the way. A recent study,”The association between exaggeration in health related science news and academic press releases: retrospective observational study” [1], looked at the correlation between exaggerated claims in press releases and subsequent media coverage. As part of that study, they examined the media coverage of about 500 health-related articles, as well as press releases put out by universities themselves. 

that this dataset has another potential use. One can look at the press releases. This removes the element of the media, and just focuses on how scientific institutions themselves are (mis)representing their work. That’s what I did here. Spoiler alert: The problem is systemic and I didn’t see any specific villains.

And lest I be accused of exaggeration myself, I should point out some major limitations. First and foremost, I’m relying on the coding that was done by the paper above. Second, my author-based results are based on web-scraping and there likely are at least a few errors (A small number of mistakes won’t affect the overall statistics but it does mean one should double-check before picking on a specific author). And lastly, all that I’m measuring here is correlation between universities/authors and exaggerated press releases. As Goldacre pointed out, press releases don’t have listed authors, so we can’t know who exactly is responsible for writing them; we certainly can’t know if misleading statements were intentional or unintentional.

Continue reading

Facebooktwitterredditpinterestlinkedinmailby feather
  1. [1]Sumner Petroc, Vivian-Griffiths Solveiga, BoivinJacky, Williams Andy, Venetis Christos A, DaviesAimée et al. The association between exaggeration in health related science news and academic press releases: retrospective observational study
  2. [2]Goldacre Ben. Preventing bad reporting on health research
Posted in Science Publishing | Tagged , | 2 Comments

Early Ebola Intervention

As I’ve alluded to in previous posts, I’m a big believer in being rational about charity. Ideally, one has several independent randomized controlled trials on which to decide how cost-effective an intervention. But sometimes that just isn’t possible. Disease outbreaks are a perfect example. Each one is different, and by the time one is able to study the situation, so much damage has been done.

It’s now believed that the current Ebola outbreak in West Africa started in Dec. 2013. Ebola had never been seen in this part of Africa before, so there was no reason to expect it.  The only way this could’ve been stopped at that point is if the entire continent of Africa were educated on recognizing Ebola, and had stockpiles of testing supplies. To say that’s unrealistic is a dramatic understatement.

In March 2014 was when the first cases were confirmed, and when MSF declared an outbreak [1]. This is the point where a massive intervention would be most efficacious. The number of cases was <100. Say each of those cases, and 10 of their closest friends were tested, so maybe 1,000 tests. At $100/test[2] that’s $100,000, which would’ve ended up saving at least 5,000 lives (and counting!). Even if each of those individuals needed to spend a night in quarantine, that’s likely another $100, so we’re up to $200,000 for 5,000 lives, or $40/life.

This assumes that the quarantine capacity already exists, and that rapid testing facilities are available. One could imagine the cost increasing 10-100x (up to $20 million).  Still $4,000/life saved seems pretty attractive. The reason it’s so cost-effective is that the intervention needed to happen early, before it could possibly be justified. If CDC had gotten involved with the required material and personnel, and some rather harsh mandatory testing and quarantine procedures were used, many lives could have been saved. And then people would be questioning why so much money was spend (and civil liberties violated) over something which turned out to be a non-issue. Was it worth it to buy that expensive umbrella when you didn’t end up getting wet?

Granted, I have the benefit of hindsight in the other direction. After seeing that this outbreak was so terrible, it’s easy to say that somebody should’ve done something earlier. All previous Ebola virus outbreaks died out after a few hundred fatalities; so throwing lots of money at it early on could’ve seemed premature. Especially when you consider that even now, deaths caused by HIV/AIDS and Malaria are on par with those caused by Ebola this year [3]. It’s difficult to prepare for some unknown, low-likelihood emergency when the day-to-day problems are so large.

Which is why the CDC, WHO, and international community should’ve gotten involved much earlier. For a long time only Doctors Without Borders was doing anything substantial[4], and they just didn’t have the resources needed. The US recently committed 3,000 troops, and up to $500 million[5]. Pay a little now or pay a lot later.

-Jacob

 

Facebooktwitterredditpinterestlinkedinmailby feather
  1. [1]http://www.msf.org.uk/article/guinea-ebola-epidemic-declared
  2. [2]http://www.bostonglobe.com/opinion/2014/10/11/stop-ebola-epidemic-must-able-diagnose-quickly/LFWpKNwHTGPqfcWRyKOKqK/story.html
  3. [3]https://i.imgur.com/At2nqgB.png
  4. [4]http://newsinfo.inquirer.net/613145/doctors-without-borders-ebola-out-of-control#ixzz3IAPjJjiL
  5. [5]http://time.com/3380545/u-s-to-commit-500-million-deploy-3000-troops-in-ebola-fight/
Posted in Public Health | Leave a comment