Properties of angry speech

Note: This post contains profanity

Sit down if you’re standing: There’s a lot of angry speech on the internet. There’s a lot of regular speech too. For exact meaning, the order and context of words is critical, but for general tone one can get pretty far just by looking at word choice.

There were two text corpora analyzed here:

0. The “Blog Authorship Corpus” [1]

1. The text from an internet rant site.

“Rant” sites are an anonymous way for people to vent their anger online, the theory being it is cathartic to express that anger in a safe setting. It also provides a nice dataset of text which is guaranteed to have much more anger than normal.

Methods: The blog corpus had formatting stripped anyway. I lowercased all the text, removed punctuation, and simply counted all the unique words. No spelling correction or stemming was done. Words in the Python Natural Language Toolkit (nltk) “stopwords” corpus were removed.

Wordclouds from each are shown below:

Word Cloud from Blog Corpus

Word Cloud from Blog Corpus

Word Cloud from Rant Site

Word Cloud from Rant Site

It’s a bit hard to interpret these, so here are bar charts from the top 20 words:

Top 20 words from Blog Corpus

Top 20 words from Blog Corpus

Top 20 Words from Rant Site Corpus

Top 20 Words from Rant Site Corpus

See the difference? It took me a minute too. For the most part they are very similar. “fuck”, “fucking”, and “hate” show up in the rant corpus and not in the blog corpus, leading to the unsurprising conclusion that people use the words a lot more when they’re angry.

To better illustrate the difference, I ranked each word, took the difference between ranks, and plotted those which had the highest difference. For instance, if “fuck” was the 10th most commonly used word in the rant site, and 200th most commonly used word in the blog corpus, it would have a difference of 190. The words with the highest rank difference are shown below.

Difference in Rank Between Rant Corpus and Blog Corpus

Difference in Rank Between Rant Corpus and Blog Corpus

The top differences are colloquialisms. “be” would be stripped out as it is a stopword, but “bee” would not. This could reflect a difference in the userbase of these two sites rather than the emotional state of the users. Both corpuses contained “people” in their top twenty words, but “peeps” was much more frequently used in the rant site corpus.

Conclusions? I was surprised at how similar these two corpora were. It could be that bloggers are angrier than I give them credit for, or it could be that the hallmarks of angry speech are subtler than I expected. Or that people don’t bother to find creative ways to say “stupid fucking shit”.


  1. [1]

    J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. (pdf)

Posted in Social Media, Text Mining | Leave a comment

Lifetime of Fortune 500 Companies

I often find myself discussing what the future world will look like. I might wonder out loud what will come after Facebook or Google or whatever. Frequently the response is “nothing”. As in, these companies are so huge that they must simply last forever.

Well my quantitative brain simply will not abide by that. Taking to the extreme, I highly doubt Google will still exist in 1,000 years. It might, but I doubt it. There are some companies [1] which are seriously old, and these seem to be small-ish companies sustained by family tradition.

To examine the big company -> last long argument, I went to the Fortune 500 list. Money magazine has data from 1955 [2]. For technical reasons I only examined data from 1955 to 2007. The basic question I was looking at is how long will a company that’s on the list one year stay on the list. That plot, for various starting years, is shown below.

Number of companies still on list

Number of companies still on list, by starting year

Number of companies still on last as a function of time after start

Number of companies still on last as a function of time after start

These plots are very similar, the second uses an x-axis of relative time instead of absolute so that the different years can be more easily compared. I’m not sure what happened in 1994-1995, so that massive drop might be an artifact or a change in methodology.

I should also mention that a firm changing their name, or undergoing a merger, would have the appearance of them disappearing for the purposes of this graph. So this metric is more about the changing of business landscape than whether a company still exists or not. For example, Twitter being bought by Facebook would be much different than Twitter going out of business, but both are much different than Twitter continuing to exist as an independent entity.

We see that the “half-life” of companies is about 25 years. A decent amount of time, but still much less than the average lifetime of a human being. Think about that next time somebody complains about big businesses not offering pension plans.

One might expect the larger firms to survive longer. I’m sure this is true, but if GM became the new Kia I think that counts as a big change in the business landscape. This next graph collects the rank of each company in 1955, and organizes them into quintiles, and demonstrates how long those companies remain in at least that quintile. So GM, ranked #1 in 1955, would need to stay in the top 100 (it had a rank of #3 in 2007, so it passed that test).

How many companies are at their starting quintile or higher

How many companies are at their starting (1955) quintile or higher

This methodology will slightly advantage those companies with a lower starting rank, as they just need to stay on the list and still have plenty of growth. However, we see that reality has a way of penalizing smaller companies. The top quintile has the longest half-life, about 27 years, while the lowest is only 5. Whether this is due to smaller companies being more likely to fail, or simply be acquired, I don’t know.

In summary, the market is a harsh mistress. Everybody knows that nothing lasts forever, but forever can be a pretty short time.


  1. [1] 
  2. [2]
Posted in Economics | Leave a comment

Theoretical maximum of the Bechdel Test


For those that may not know, the Bechdel test has 3 criteria. A movie/TV show/book/whatever must have at least 1 scene where there exist:

0. 2 female characters

1. talking to each other

2. about something other than a man

A rather depressing amount of movies/TV shows fail this relatively simple test. Most characters seem to be male, and the female ones exist just so that men will have love interests. Since I’m a numbers guy, I wondered what fraction of media would pass the test if there were actually no sexism.

Continue reading

Posted in Uncategorized | 2 Comments

It’s the end of the world again

In a move that surprised very few people, the world did not end today. Or yesterday. Slightly more surprising was that it didn’t end Friday, but only slightly.

Newton was apparently not a fan of the various apocalyptic predictions, and in predicting (in 1704!) that the world would not end before 2060 said:

This I mention not to assert when the time of the end shall be, but to put a stop to the rash conjectures of fanciful men who are frequently predicting the time of the end, and by doing so bring the sacred prophesies into discredit as often as their predictions fail.

Sadly, it’s only gotten worse since his time. I’m not sure what causes people to make a prediction which requires the prognasticator to die if proven right, but it seems like a popular hobby regardless. I was curious what sorts of timelines are popular, so I put together a chart based on Wikipedias surprisingly long list


I’ve got number of years between when the prediction was made and when it was supposed to happen on the X-axis, and the year of the predicted end on the Y-axis. I coded 1-day as 0.01 years, and 1 month as 0.1 years. Much of this data is ballpark, as it was too time consuming to track down the actual year a prediction was made, so I just took predictors birth year + 20 when I couldn’t quickly locate a year. The rather large number of year-2000 predictions are excluded, believe it or not, this only includes predictions up to 1999 CE. Negative numbers are BCE.

The data doesn’t cluster too well, which I thought was surprising.There are definitely more predictions post-1500, although that’s most likely a reflection of our society having more people and keeping better records. 10-100 years looks to capture a plurality of predictions. One poll found 15% of people think the world will end during their lifetime, and that was taken May 2012. I’d be curious to see how that poll would look  when taken next year. One would think it would dip, as the end of the Mayan calendar came and went with nary an apocalypse to show for it. However, my guess would be that people just have a hard time believing the world will keep going for a long long time, finding an event to actually end it is a secondary concern.

There are a few common themes in predictions:

0. “<Horrible event> is a sign that the world is ending”. There are a few of these, and they tend to have short timescales. If I woke up one day and the sky was dark, I might get pretty freaked out.

Believing that whatever war was currently happening was a sign of imminent world-wide destruction is also fairly common. I sympathize; if any of the nuclear powers went to war I’d get pretty nervous. However, people often put forth the argument that the war/unrest currently being experienced in different from the thousands of wars experienced throughout history. In 2002 somebody handed me a flyer explaining that the current unrest in the Middle East was a sign of the end, which I thought was pretty funny. If war in the Middle East meant the world was ending soon it would’ve ended a few dozen times over the past century.

1. “Through complicated math I have determined that the Bible says Jesus will return on X”. These predictions seemed to have timescales ranging from a few decades to a few centuries. Jesus is apparently not a patient fellow. Many seemed to think that the world would only last 6000 years from the date of creation, so if you trace that back to about 5000 BCE (give or take) you then get when the world ends

2. “My last prediction was wrong, but it’s gonna happen pretty soon no really this time”. My favorite category. The initial predicted date was for roughly 10-20 years after the date of prediction, revisions were usually for a couple years. Harold Camping made predictions of the form “rapture on this date, end of the world to follow soon thereafter”. Which had the nice effect that when the rapture didn’t happen, he could just say “well maybe it’ll happen closer to the end of the world”.

If I’m around to see the end of the world, I just hope it happens quickly. I’d hate to end up living in Zion eating gruel with Neo. More importantly, many thousands of people would come out of the woodwork claiming to have predicted it, and listening to their bullshit would be incredibly painful.



Posted in Uncategorized | Leave a comment

Crazy things on which to spend money

Modern society is full of marvels. Computers with more storage and processing power than existed 50 years ago are available to everyone. Flying across the world may not be “cheap”, but it’s doable. We take a lot of these things for granted and don’t stop to wonder about it. If you ever find yourself with some extra money, here are some ways to get rid of it.

Play with some construction equipment

Dig This offers a “playground for adults”. I love swings and slides as much as the next guy, but operating some heavy equipment for an hour and a half is a bit more intense. Not too much, price-wise, if you’re already in vegas, at a mere $250.

The Rosetta Disk

The 13,000 pages in the collection contain documentation on over 1500 languages gathered from archives around the world. For each language we have several categories of data—descriptions of the speech community, maps of their location(s), and information on writing systems and literacy.

Each disk is a repository of human languages which lasts many thousands of years. It can be read with only optical technology, ensuring that even if society collapses the information is not lost. Last time I talked to the people at LongNow, each one cost about $5k. Not sure if they’re still making and/or selling them, but it would be worth an email.


Getting up into space will set you back about $15 million, a bit out of my price range. The Zero G Corporation has a slightly more cost-effective option:

On our specially modified Boeing 727, parabolic arcs are performed to create a weightless environment allowing you to float, flip and soar as if you were in space.

For the low low price of $4950 (+5% tax), 8 minutes of weightlessness can be yours.

Whole genome sequencing

As far as I know, there aren’t any companies out there offering this as a product. But they do it as parts of studies, so with a few phone calls one could likely get their entire genome sequenced. According to some coworkers, the cost would be around $5k. If one reallywants genetic information but is hardup for cash, there are cheaper options:
0. 23andMe. What they do isn’t whole genome sequencing, rather they look for the presence of common mutations. Fast, convenient, and only $100.
1. The Personal Genome Project. Your entire genome is sequenced, and other medical data collected (including your medical records), to advance medical knowledge. The catch is that all of this data is made public. It’s not attached to your name, but whether genetic information can ever be anonymous is somewhat debatable[1]. I applaud the people who join this project, and believe it could do a lot of good. But I won’t be signing up soon. Maybe on my deathbed.
  1. [1]By which I mean it almost certainly can’t but that’s an unpleasant truth
Posted in Uncategorized | Leave a comment