More on the Bechdel Test

I gave some theoretical insights on the Bechdel test in a previous post, but silly me, of course there is real data! The Cornell Movie-Dialogs Corpus[1] contains conversations between characters in 617 movies.

Conversations in this corpus are already separated, so it’s easy to tell when two people are talking to each other. Most characters are annotated with a gender. Most, but not all. I inferred gender based on the census’ list of popular boys and girls names[2], this added some more information. All in all there were 9,035 characters: 3,027 male, 1,572 female, and 4,436 unknown. Lots of unknowns unfortunately, which means I wouldn’t trust these numbers too much on an absolute scale.

We do have a natural comparison. The actual Bechdel test requires two women talking to each other about something other than a man. We can easily construct a male version: two men talking to each other about something other than a woman. I’ll be comparing these quantities.

Character Ratios

First a quick pass through to count the number of male/female characters. I took the log2 ratio of male/female characters so that the view would be symmetric. A perfectly balanced cast would be at 0, +1 means twice as many male characters, -1 means twice as many female.

male_female_ratio_characters

The overall median is a 2:1 ratio of male:female characters, and it’s remarkable consistent across genres. There is a pretty wide variance, which may be due to the incomplete gender-tagging of names in the corpus.

Conversations

Now the hard part. We need to identify conversations which are between two women only, and about something other than a man. I’m also doing the reverse, identifying conversations between two men which are about something other than a woman, for comparison.

Checking the gender is straightforward (it’s either annotated in the database or its not) and I’m only counting convos that pass if both characters are KNOWN to be women(men). So characters with unknown gender are excluded.

Checking the topic is a bit harder. The method I’m using is simple: check for the presence of a male(female) character name (in the same movie) in the conversation, as well as known male(female) pronouns. Obviously this isn’t perfect, but since I’m doing an apples-to-apples comparison between men and women any flaws should balance out. Technically the Bechdel test only requires 1 passing conversation, for robustness in this analysis I required 2 per movie.

number_passing_movies

Number of Movies Passing Each Version

fraction_passing_movies

Fraction of Movies in Genre Passing Each Version

The top graph shows movies by total count, the bottom shows by fraction. Nearly all movies pass at least 1 version. About 75% of movies (red + blue) pass the male version, while about 40% (blue + purple) pass the female version. Action and adventure movies are the most male-biased (surprise!)[3]

Romance, comedy, and horror come the closest to parity. I’m surprised about the last category, I would’ve that horror would be male-dominated.  And even animation had very few movies passing; won’t somebody think of the children! There were only 10 movies in this genre though so it may not be representative.

Looking only at movies which passed each respective test, we can see how many passing conversations existed:

passing_convos_genre

This may be a bit hard to read. Blue is female, red is male, they’re next to each other by genre, and the y-axis is the number of passing conversations per movie (on a log10 scale). For the most part, movies which pass the male Bechdel test pass a whole lot more than then female. The median number of male-passing conversations is about 40, for female it’s only 10.

That’s a 4:1 ratio, twice as much as the 2:1 ratio we saw of characters. Which is what one might expect given the bias for male charecters, as the number of possible conversation pairs are ~(number of characters)^2. Or it could be that the male characters are more prominent in the story, and hence occupy more screentime.

Other Resources

bechdeltest.com has an enormous manually curated list of movies and their passing status. This post also has some excellent visualizations, based on a much larger set of movies. And near and dear to my heart, there’s an analysis of every Star Trek episode on The Mary Sue Blog.

-Jacob

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1] Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (CMCL ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 76-87. http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
  2. [2] https://catalog.data.gov/dataset/names-from-census-1990
  3. [3] Neither of the modern Tombraider movies pass (according to bechdeltest.com), despite starring a woman, because she’s the only one
Posted in Text Mining | Leave a comment

Some musings on statistics

A) Beware of The Wrong Summary Statistics

SlateStarCodex had a pretty interesting post entitle “Beware of Summary Statistics“, showing how they can be misleading. This isn’t exactly new, there are famous examples of how just looking at the mean and standard deviation greatly oversimplifies; distributions can have the exact same mean/stdev. but be very different[1]. The main lesson to take-away is to always visualize your data.

If you know the distribution in question, though, there is probably a good summary statistic for it. The go-to in social science is pearson correlation. SSC gave an example of two variables which appeared to be correlated but that correlation was highly misleading. Here are two “uncorrelated” variables:

fake_correlations

The linear fit shows that X and Y are uncorrelated. The pearson correlation is nearly 0. However that is obviously BS as there is a clear trend, just not linear, which is what pearson correlation captures. With the benefit of viewing the data[2], we can correlate Y vs inverse-sin(Y)  (orange). That correlation is 0.99.  The real relationship is Y= Asin(fX), where A = 1 and f = 1. A mean/standard deviation for this would be meaningless, but amplitude/frequency would describe it perfectly.

Of course this is a rigged example, and I generated the data from a sine wave. In a real-world example, one sometimes knows (has some idea) what shape the distribution will be. If one doesn’t, visualize it and figure it out.

B) Exact Wording Matters

The most famous example I know of is an old study by the Gates Foundation showing that the best schools are small schools. So obviously we need to look at small schools and see why they’re so great, right? Well, no, because the worst schools are also small schools. Small school -> small sample size -> high variance, meaning the outliers are always going to be found in smaller sample sizes:

grade_change_vs_school_size

Source: The Promise and Pitfalls of Using Imprecise School Accountability Measures
Thomas J. Kane and Douglas O. Staiger.[3]

One of the earliest papers on cognitive biases looked at this[4], they asked people if large hospitals or small hospitals are more likely to have more days where >60% of babies born on that day were male. Most people said the same, because the odds of being born male are the same for any particular baby in either case. But pay closer attention to that wording; it wasn’t about the overall average, it was about the variance. Simpler example: If you flip two quarters at a time, occasionally they’ll all (re: both) come out heads. If you flip 10 quarters at a time, very rarely will they all be heads.

C) Confounders and Conditional (In)dependence

I love Simpson’s Paradox . Trends which exist in aggregated data can reverse direction when data is broken into subgroups. In the most general case, if subgroups exist, a trend which applies to the aggregate doesn’t have to exist in subgroups, and if it does, doesn’t have to be in the same direction. And vice versa going the other direction, from subgroup to overall.

subgroups

In the above chart, Y has an overall linear trend against X. But once it’s known whether the point is in S1 or S2, the dependence goes away. So Y is conditionally independent of X. Interpretation will depend on the problem situation. If the difference between S1 and S2 is something we care about, it’s interesting and we publish a paper. Champagne for everybody! If not, it’s a confounder (boo! hiss!).

The easiest way to deal with confounders is to analyze groups separately. Say you’re interested in discovering people that walk fast spend more on shoes. Well age affects walking speed, so to remove that confounder, one could stratify patients into different groups. Confounder removed! It’s a good idea, and it has two serious drawbacks:

1. Each group has a smaller sample size, which increases the variance.

2. Testing multiple groups means testing multiple hypotheses.

These errors compound each other. We’ve got several smaller sample sizes meaning the variance is larger, so the odds of getting at least one false positive gets much larger (see section B)[5]. The social science studies I read never correct for multiple hypotheses, gee I wonder why :-).

Closing Thought

While finishing this post I came across an article about a deliberate scientific “fraud”. The authors did the experiment they said, didn’t make up any data; the only thing which makes this fraud different from so many others is that the authors are publicly saying the result is bullshit. I almost typed “the authors *knew* the result is bullshit” except I’m sure most other snake-oil salesmen know that too. Life is complicated, so don’t trust anybody selling easy answers.

-Jacob

 

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1] e.g. Anscombe’s Quartet. http://en.wikipedia.org/wiki/Anscombe%27s_quartet
  2. [2] and that I generated it
  3. [3]  Journal of Economic Perspectives—Volume 16, Number 4—Fall 2002—Pages 91–114.  Figure 2. http://pubs.aeaweb.org/doi/pdfplus/10.1257/089533002320950993
  4. [4] Judgment under Uncertainty: Heuristics and Biases. Amos Tversky; Daniel Kahneman
    Science, New Series, Vol. 185, No. 4157. (Sep. 27, 1974), pp. 1124-1131. http://psiexp.ss.uci.edu/research/teaching/Tversky_Kahneman_1974.pdf 
  5. [5] SSC calls this the “Elderly Hispanic Woman Effect”
Posted in Statistics | Leave a comment

Subreddit Map

Reddit describes itself as the “front page of the internet”, and given how many users it has, that’s not too far off. It’s divided into subreddits, which can have either broad or narrow topics. These subreddits are (mostly) user-created, with the admins only occasionally to step in to remove them. Thus, subreddits represent an “organic” set of topics on social media.

There have been a few subreddit maps created before like Vizit [1] which was based on cross-posts[2]. Here I’m interested measuring overlap of users; that is, how many users are in common between different subreddits. (Correction: I originally thought redditviz[3] was based on crossposts, but it’s not, it’s based on users, so check that out for a larger version of the same idea). This presented some practical difficulties because scraping comments is a lot more demanding than scraping posts, I started with comments for 2,000 subreddits. After removing low-weight edges to remove noise, and removing isolated subreddits, I ended up with about 900.

The full map can be viewed here

The networks (pre- and post- filtering) are available here.

Continue reading

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1] Vizit. http://redditstuff.github.io/sna/vizit/
  2. [2] Where the same link is posted to multiple subreddits
  3. [3] Redditviz. http://arxiv.org/abs/1312.3387 http://rhiever.github.io/redditviz/
Posted in reddit, Social Media, Text Mining | Leave a comment

Exaggeration of Science

Communicating scientific results to the public is difficult, even with the best intentions. There are all kinds of subtleties in any study which don’t make it into media coverage. Furthermore, caveats about interpreting results get lost along the way. A recent study,”The association between exaggeration in health related science news and academic press releases: retrospective observational study” [1], looked at the correlation between exaggerated claims in press releases and subsequent media coverage. As part of that study, they examined the media coverage of about 500 health-related articles, as well as press releases put out by universities themselves. 

that this dataset has another potential use. One can look at the press releases. This removes the element of the media, and just focuses on how scientific institutions themselves are (mis)representing their work. That’s what I did here. Spoiler alert: The problem is systemic and I didn’t see any specific villains.

And lest I be accused of exaggeration myself, I should point out some major limitations. First and foremost, I’m relying on the coding that was done by the paper above. Second, my author-based results are based on web-scraping and there likely are at least a few errors (A small number of mistakes won’t affect the overall statistics but it does mean one should double-check before picking on a specific author). And lastly, all that I’m measuring here is correlation between universities/authors and exaggerated press releases. As Goldacre pointed out, press releases don’t have listed authors, so we can’t know who exactly is responsible for writing them; we certainly can’t know if misleading statements were intentional or unintentional.

Continue reading

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1] Sumner Petroc, Vivian-Griffiths Solveiga, BoivinJacky, Williams Andy, Venetis Christos A, DaviesAimée et al. The association between exaggeration in health related science news and academic press releases: retrospective observational study
  2. [2] Goldacre Ben. Preventing bad reporting on health research
Posted in Science Publishing | Tagged , | 2 Comments

Early Ebola Intervention

As I’ve alluded to in previous posts, I’m a big believer in being rational about charity. Ideally, one has several independent randomized controlled trials on which to decide how cost-effective an intervention. But sometimes that just isn’t possible. Disease outbreaks are a perfect example. Each one is different, and by the time one is able to study the situation, so much damage has been done.

It’s now believed that the current Ebola outbreak in West Africa started in Dec. 2013. Ebola had never been seen in this part of Africa before, so there was no reason to expect it.  The only way this could’ve been stopped at that point is if the entire continent of Africa were educated on recognizing Ebola, and had stockpiles of testing supplies. To say that’s unrealistic is a dramatic understatement.

In March 2014 was when the first cases were confirmed, and when MSF declared an outbreak [1]. This is the point where a massive intervention would be most efficacious. The number of cases was <100. Say each of those cases, and 10 of their closest friends were tested, so maybe 1,000 tests. At $100/test[2] that’s $100,000, which would’ve ended up saving at least 5,000 lives (and counting!). Even if each of those individuals needed to spend a night in quarantine, that’s likely another $100, so we’re up to $200,000 for 5,000 lives, or $40/life.

This assumes that the quarantine capacity already exists, and that rapid testing facilities are available. One could imagine the cost increasing 10-100x (up to $20 million).  Still $4,000/life saved seems pretty attractive. The reason it’s so cost-effective is that the intervention needed to happen early, before it could possibly be justified. If CDC had gotten involved with the required material and personnel, and some rather harsh mandatory testing and quarantine procedures were used, many lives could have been saved. And then people would be questioning why so much money was spend (and civil liberties violated) over something which turned out to be a non-issue. Was it worth it to buy that expensive umbrella when you didn’t end up getting wet?

Granted, I have the benefit of hindsight in the other direction. After seeing that this outbreak was so terrible, it’s easy to say that somebody should’ve done something earlier. All previous Ebola virus outbreaks died out after a few hundred fatalities; so throwing lots of money at it early on could’ve seemed premature. Especially when you consider that even now, deaths caused by HIV/AIDS and Malaria are on par with those caused by Ebola this year [3]. It’s difficult to prepare for some unknown, low-likelihood emergency when the day-to-day problems are so large.

Which is why the CDC, WHO, and international community should’ve gotten involved much earlier. For a long time only Doctors Without Borders was doing anything substantial[4], and they just didn’t have the resources needed. The US recently committed 3,000 troops, and up to $500 million[5]. Pay a little now or pay a lot later.

-Jacob

 

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1] http://www.msf.org.uk/article/guinea-ebola-epidemic-declared
  2. [2] http://www.bostonglobe.com/opinion/2014/10/11/stop-ebola-epidemic-must-able-diagnose-quickly/LFWpKNwHTGPqfcWRyKOKqK/story.html
  3. [3] https://i.imgur.com/At2nqgB.png
  4. [4] http://newsinfo.inquirer.net/613145/doctors-without-borders-ebola-out-of-control#ixzz3IAPjJjiL
  5. [5] http://time.com/3380545/u-s-to-commit-500-million-deploy-3000-troops-in-ebola-fight/
Posted in Public Health | Leave a comment