The comments of the few outweigh the comments of the many

The Pareto Principle for businesses states that 80% of sales come from 20% of customers. Social media has the same skew; the majority of content comes from a minority of users. I’ve always been curious just how skewed this activity can be. In particular, the skew won’t be the same across different forums. Reddit provides a natural opportunity to measure this skew, there are an enormous number of subreddits with varying levels of activity. Subreddits provide straightforward topic-association, so we can see which topics tend to be more dominated by fewer individuals.

Dataset

I used the corpus of reddit comments collected by /u/Stuck_In_the_Matrix[1]. To keep things manageable I only used one month of data, January 2015. This corpus includes 53,851,542 total comments, including unique 2,512,122 usernames, and (allegedly) represents every public comment on reddit during that time. For simplicity I’ll use the terms username/user/author interchangeably, although strictly speaking a user may have many usernames. I excluded comments which were deleted, or by any of about a dozen highly active bots (AutoModerator, autowikibot, etc.). Only included subreddits with at least 1000 users were included.

Totals: 1,322 subreddits. 2,512,103 users. 42,033,578 comments. Woot!

The metric I’ll be presenting here is the Gini coefficient. It was developed for measuring wealth inequality, and can be applied to any frequency distribution. It takes a value of 0 for a perfectly equal distribution, and 1 for completely unequal (1 person responsible for all the wealth/comments, none by anybody else). The full processed statistics are available in a fusion table here, including some other statistics not discussed in this post.

Results

The distribution of Gini coefficients across subreddits is shown below. The average value is 0.59, with most subreddits falling between 0.4 and 0.8.

gini_histogram2

I was a little surprised by this plot, mainly by the spread. An average Gini of 0.59 seems reasonably, that’s an intense skew. For reference, the Gini coefficient of US income is 0.47. There is a very wide spread, though. Some subreddits are very highly condensed, and some are much more egalitarian.

We can also look at the most and least egalitarian subreddits. Here are the top and bottom 50 subreddits by Gini index:

SpeculationDiscussion

The low-gini category seems likely mostly pictures; /r/gif, /r/AnimalsBeingBros, /r/AnimalsBeingJerks, /r/HumanPorn (for human portraits). /r/millionairemakers is an interesting sub; a random winner is chosen and subscribers are asked to donate $1, with the hope of making one person a millionaire. They haven’t made any millionaires but they’ve made some people a few thousand. Among the other high-Gini subs we see sports-related subs ( /r/nfl, /r/BostonBruins ) and some other entertainment subs. /r/RWBY is an anime web series, /r/MLPLounge and /r/mylittlepony are both present. Sidenote: This might be the first time I’ve seen /r/Coontown (racist) and /r/GamerGhazi (anti-gamergate / pro-social-justice) so close to each other[2].

Putting these together, it seems like more casual subreddits have the lowest Gini. Nobody is super-passionate about /r/Eyebleach, you just go to look at pretty pictures.  The high-gini subs have topics which people get a bit more passionate about; their favorite show, sports team, video game, etc. There are exceptions; /r/arresteddevelopment is a low-Gini subreddit for instance. A small core of extremely passionate individuals is what makes a high-Gini environment. I’m sure many users on /r/cigars just want a few tips, but I’m equally sure some people are mega-serious about getting the best damn cigar they can.

Caveats

Since this is a complete dataset there shouldn’t be much in the way of selection biases. There were 3 million deleted comments which represent ~6% of the total, not a huge amount. Also, there is no way to link usernames to people, so interpretation gets a little bit complicated. Without knowing how many alternate accounts exist, or the commenting behavior on them, it’s hard to know how inequality in comments-per-username translates into inequality in comments-per-real-person.

Also, I excluded the most prolific bots, but there are likely some I missed. A prolific bot targeted at a specific subreddit will have very high activity and could cause an artificially high Gini index.

Shout-Outs

Most Number of Subreddits: /u/codex561, who commented in 1,109 different subreddits. Way to go! [3]

Largest Number of Comments: /u/Doctor-Kitten, who commented an astonishing 13,380 times, more than any other non-bot user!

Highest Total Comment Karma: /u/PainMatrix, who commented 1,361 times, achieving a total score (net up and down) of 187,374!

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Original reddit post: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/. Internet archive link: https://archive.org/details/2015_reddit_comments_corpus
  2. [2]/r/goodyearwelt I’m sorry to have put you in the middle of these two, but it wasn’t my fault, it was the math!
  3. [3] Honorable mention to /u/lolhaibai at 2,252 subreddits, but who was disqualified from these pointless awards because they deleted their account. And dishonorable mention to /u/VeryAwesome69, who had activity on 1,116 subreddits, each and every comment being the question “Ever had a dream where all your friends were dinosaurs?”. I have not.
Posted in Uncategorized | Leave a comment

More on the Bechdel Test

I gave some theoretical insights on the Bechdel test in a previous post, but silly me, of course there is real data! The Cornell Movie-Dialogs Corpus[1] contains conversations between characters in 617 movies.

Conversations in this corpus are already separated, so it’s easy to tell when two people are talking to each other. Most characters are annotated with a gender. Most, but not all. I inferred gender based on the census’ list of popular boys and girls names[2], this added some more information. All in all there were 9,035 characters: 3,027 male, 1,572 female, and 4,436 unknown. Lots of unknowns unfortunately, which means I wouldn’t trust these numbers too much on an absolute scale.

We do have a natural comparison. The actual Bechdel test requires two women talking to each other about something other than a man. We can easily construct a male version: two men talking to each other about something other than a woman. I’ll be comparing these quantities.

Character Ratios

First a quick pass through to count the number of male/female characters. I took the log2 ratio of male/female characters so that the view would be symmetric. A perfectly balanced cast would be at 0, +1 means twice as many male characters, -1 means twice as many female.

male_female_ratio_characters

The overall median is a 2:1 ratio of male:female characters, and it’s remarkable consistent across genres. There is a pretty wide variance, which may be due to the incomplete gender-tagging of names in the corpus.

Conversations

Now the hard part. We need to identify conversations which are between two women only, and about something other than a man. I’m also doing the reverse, identifying conversations between two men which are about something other than a woman, for comparison.

Checking the gender is straightforward (it’s either annotated in the database or its not) and I’m only counting convos that pass if both characters are KNOWN to be women(men). So characters with unknown gender are excluded.

Checking the topic is a bit harder. The method I’m using is simple: check for the presence of a male(female) character name (in the same movie) in the conversation, as well as known male(female) pronouns. Obviously this isn’t perfect, but since I’m doing an apples-to-apples comparison between men and women any flaws should balance out. Technically the Bechdel test only requires 1 passing conversation, for robustness in this analysis I required 2 per movie.

number_passing_movies

Number of Movies Passing Each Version

fraction_passing_movies

Fraction of Movies in Genre Passing Each Version

The top graph shows movies by total count, the bottom shows by fraction. Nearly all movies pass at least 1 version. About 75% of movies (red + blue) pass the male version, while about 40% (blue + purple) pass the female version. Action and adventure movies are the most male-biased (surprise!)[3]

Romance, comedy, and horror come the closest to parity. I’m surprised about the last category, I would’ve that horror would be male-dominated.  And even animation had very few movies passing; won’t somebody think of the children! There were only 10 movies in this genre though so it may not be representative.

Looking only at movies which passed each respective test, we can see how many passing conversations existed:

passing_convos_genre

This may be a bit hard to read. Blue is female, red is male, they’re next to each other by genre, and the y-axis is the number of passing conversations per movie (on a log10 scale). For the most part, movies which pass the male Bechdel test pass a whole lot more than then female. The median number of male-passing conversations is about 40, for female it’s only 10.

That’s a 4:1 ratio, twice as much as the 2:1 ratio we saw of characters. Which is what one might expect given the bias for male charecters, as the number of possible conversation pairs are ~(number of characters)^2. Or it could be that the male characters are more prominent in the story, and hence occupy more screentime.

Other Resources

bechdeltest.com has an enormous manually curated list of movies and their passing status. This post also has some excellent visualizations, based on a much larger set of movies. And near and dear to my heart, there’s an analysis of every Star Trek episode on The Mary Sue Blog.

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (CMCL ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 76-87. http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
  2. [2]https://catalog.data.gov/dataset/names-from-census-1990
  3. [3]Neither of the modern Tombraider movies pass (according to bechdeltest.com), despite starring a woman, because she’s the only one
Posted in Text Mining | Leave a comment

Some musings on statistics

A) Beware of The Wrong Summary Statistics

SlateStarCodex had a pretty interesting post entitle “Beware of Summary Statistics“, showing how they can be misleading. This isn’t exactly new, there are famous examples of how just looking at the mean and standard deviation greatly oversimplifies; distributions can have the exact same mean/stdev. but be very different[1]. The main lesson to take-away is to always visualize your data.

If you know the distribution in question, though, there is probably a good summary statistic for it. The go-to in social science is pearson correlation. SSC gave an example of two variables which appeared to be correlated but that correlation was highly misleading. Here are two “uncorrelated” variables:

fake_correlations

The linear fit shows that X and Y are uncorrelated. The pearson correlation is nearly 0. However that is obviously BS as there is a clear trend, just not linear, which is what pearson correlation captures. With the benefit of viewing the data[2], we can correlate Y vs inverse-sin(Y)  (orange). That correlation is 0.99.  The real relationship is Y= Asin(fX), where A = 1 and f = 1. A mean/standard deviation for this would be meaningless, but amplitude/frequency would describe it perfectly.

Of course this is a rigged example, and I generated the data from a sine wave. In a real-world example, one sometimes knows (has some idea) what shape the distribution will be. If one doesn’t, visualize it and figure it out.

B) Exact Wording Matters

The most famous example I know of is an old study by the Gates Foundation showing that the best schools are small schools. So obviously we need to look at small schools and see why they’re so great, right? Well, no, because the worst schools are also small schools. Small school -> small sample size -> high variance, meaning the outliers are always going to be found in smaller sample sizes:

grade_change_vs_school_size

Source: The Promise and Pitfalls of Using Imprecise School Accountability Measures
Thomas J. Kane and Douglas O. Staiger.[3]

One of the earliest papers on cognitive biases looked at this[4], they asked people if large hospitals or small hospitals are more likely to have more days where >60% of babies born on that day were male. Most people said the same, because the odds of being born male are the same for any particular baby in either case. But pay closer attention to that wording; it wasn’t about the overall average, it was about the variance. Simpler example: If you flip two quarters at a time, occasionally they’ll all (re: both) come out heads. If you flip 10 quarters at a time, very rarely will they all be heads.

C) Confounders and Conditional (In)dependence

I love Simpson’s Paradox . Trends which exist in aggregated data can reverse direction when data is broken into subgroups. In the most general case, if subgroups exist, a trend which applies to the aggregate doesn’t have to exist in subgroups, and if it does, doesn’t have to be in the same direction. And vice versa going the other direction, from subgroup to overall.

subgroups

In the above chart, Y has an overall linear trend against X. But once it’s known whether the point is in S1 or S2, the dependence goes away. So Y is conditionally independent of X. Interpretation will depend on the problem situation. If the difference between S1 and S2 is something we care about, it’s interesting and we publish a paper. Champagne for everybody! If not, it’s a confounder (boo! hiss!).

The easiest way to deal with confounders is to analyze groups separately. Say you’re interested in discovering people that walk fast spend more on shoes. Well age affects walking speed, so to remove that confounder, one could stratify patients into different groups. Confounder removed! It’s a good idea, and it has two serious drawbacks:

1. Each group has a smaller sample size, which increases the variance.

2. Testing multiple groups means testing multiple hypotheses.

These errors compound each other. We’ve got several smaller sample sizes meaning the variance is larger, so the odds of getting at least one false positive gets much larger (see section B)[5]. The social science studies I read never correct for multiple hypotheses, gee I wonder why :-).

Closing Thought

While finishing this post I came across an article about a deliberate scientific “fraud”. The authors did the experiment they said, didn’t make up any data; the only thing which makes this fraud different from so many others is that the authors are publicly saying the result is bullshit. I almost typed “the authors *knew* the result is bullshit” except I’m sure most other snake-oil salesmen know that too. Life is complicated, so don’t trust anybody selling easy answers.

-Jacob

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]e.g. Anscombe’s Quartet. http://en.wikipedia.org/wiki/Anscombe%27s_quartet
  2. [2]and that I generated it
  3. [3]  Journal of Economic Perspectives—Volume 16, Number 4—Fall 2002—Pages 91–114.  Figure 2. http://pubs.aeaweb.org/doi/pdfplus/10.1257/089533002320950993
  4. [4]Judgment under Uncertainty: Heuristics and Biases. Amos Tversky; Daniel Kahneman
    Science, New Series, Vol. 185, No. 4157. (Sep. 27, 1974), pp. 1124-1131. http://psiexp.ss.uci.edu/research/teaching/Tversky_Kahneman_1974.pdf 
  5. [5]SSC calls this the “Elderly Hispanic Woman Effect”
Posted in Statistics | Leave a comment

Subreddit Map

Reddit describes itself as the “front page of the internet”, and given how many users it has, that’s not too far off. It’s divided into subreddits, which can have either broad or narrow topics. These subreddits are (mostly) user-created, with the admins only occasionally to step in to remove them. Thus, subreddits represent an “organic” set of topics on social media.

There have been a few subreddit maps created before like Vizit [1] which was based on cross-posts[2]. Here I’m interested measuring overlap of users; that is, how many users are in common between different subreddits. (Correction: I originally thought redditviz[3] was based on crossposts, but it’s not, it’s based on users, so check that out for a larger version of the same idea). This presented some practical difficulties because scraping comments is a lot more demanding than scraping posts, I started with comments for 2,000 subreddits. After removing low-weight edges to remove noise, and removing isolated subreddits, I ended up with about 900.

The full map can be viewed here

The networks (pre- and post- filtering) are available here.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Vizit. http://redditstuff.github.io/sna/vizit/
  2. [2]Where the same link is posted to multiple subreddits
  3. [3]Redditviz. http://arxiv.org/abs/1312.3387 http://rhiever.github.io/redditviz/
Posted in reddit, Social Media, Text Mining | 2 Comments

Exaggeration of Science

Communicating scientific results to the public is difficult, even with the best intentions. There are all kinds of subtleties in any study which don’t make it into media coverage. Furthermore, caveats about interpreting results get lost along the way. A recent study,”The association between exaggeration in health related science news and academic press releases: retrospective observational study” [1], looked at the correlation between exaggerated claims in press releases and subsequent media coverage. As part of that study, they examined the media coverage of about 500 health-related articles, as well as press releases put out by universities themselves. 

that this dataset has another potential use. One can look at the press releases. This removes the element of the media, and just focuses on how scientific institutions themselves are (mis)representing their work. That’s what I did here. Spoiler alert: The problem is systemic and I didn’t see any specific villains.

And lest I be accused of exaggeration myself, I should point out some major limitations. First and foremost, I’m relying on the coding that was done by the paper above. Second, my author-based results are based on web-scraping and there likely are at least a few errors (A small number of mistakes won’t affect the overall statistics but it does mean one should double-check before picking on a specific author). And lastly, all that I’m measuring here is correlation between universities/authors and exaggerated press releases. As Goldacre pointed out, press releases don’t have listed authors, so we can’t know who exactly is responsible for writing them; we certainly can’t know if misleading statements were intentional or unintentional.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Sumner Petroc, Vivian-Griffiths Solveiga, BoivinJacky, Williams Andy, Venetis Christos A, DaviesAimée et al. The association between exaggeration in health related science news and academic press releases: retrospective observational study
  2. [2]Goldacre Ben. Preventing bad reporting on health research
Posted in Science Publishing | Tagged , | 2 Comments