Classifying and visualizing with fastText and tSNE

Previously I wrote a three-part series on classifying text, in which I walked through the creation of a text classifier from the bottom up. It was interesting but it was purely an academic exercise. Here I’m going to use methods suitable for scaling up to large datasets, preferring tools written by others to those written by myself. The end goal is the same: classifying and visualizing relationships between blocks of text.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Machine Learning, Statistics, Text Mining, Uncategorized, Visualizations | 1 Comment

Classifying text with Keras: Visualization

This is part 3 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

The output of our algorithm is a probability distribution based on the softmax function. We specified 14 different categories, so it has 14 dimensions, each containing a number from [0,1], whose values sum to 1. To “classify” a particular entry we simply take the largest probability but our dataset is richer than that.

Visualizing the output of a model can deliver a lot more insight than just “right” or “wrong”. We can see how different classes are related, what datapoints get classified incorrectly, and why that might be. We can see which classes are similar to each other, and which are completely different. In this post, we’ll look at a few different visualization methods.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part3.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Uncategorized | Leave a comment

Classifying Text with Keras: Logging

This is part 2 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

In part 1 we did the bare minimum to create a functional text classifier. In the real world, we like have multiple indicators of performance. This post will demonstrate how to add more extensive measuring and logging to our model.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part2.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Machine Learning | Tagged , | 3 Comments

Classifying Text with Keras: Basic Text Processing

This is part 3 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

This post covers many nuts and bolts involved in creating a text classifier (experienced programmers may wish to skip ahead). We will use a relatively well-behaved text dataset which has been split into 14 categories. This post goes over the basics of tokenizing sentences into words, creating a vocabulary (or using a pre-built one), and finally, constructing and training neural network to classify text into each category.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part1.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Machine Learning | Tagged , | Leave a comment

(Hyper) Rational Travel Planning

There’s an expression floating around the rationalist corner of the interwebs: “If you’ve never missed a flight, you’re spending too much time in airports”. The idea is that to be sure of making a flight, one needs to get to the airport really early (duh). All those times of being early add up, though, and are potentially more costly. I’m going to take at look at that, but first let’s start with a similar but simpler situation.

Continuous Travel Time

Consider meeting a friend for dinner. Being a good person you don’t want to be late.
Being a reasonable person you also don’t want to waste your time and be early. You’re
just meeting one person, and they’re a friend, so you value their time as much as you value your own. Thus the cost of being early should be the same as the cost of being late, meaning it should be symmetric. Also, small differences matter less than large ones. Your friend won’t even notice if you’re 1 minute late, but they might get a bit annoyed if you’re 15 minutes late. The squared loss has both of these properties.

Loss(Actual arrival time, desired arrival time) = (actual – desired)^2/scale^2

To get really concrete, we need to have some probability distribution over travel times. For this case, I’ll use the Gamma(shape=3, scale=1) distribution, with a minimum of 15 minutes. Assume that we’re walking/biking/driving in low traffic so that there are no discontinuous jumps.

dinner_pdfMean of pdf: 18.00 Optimal buffer time: 18.000
Min Expected Loss: 0.009 Stdev Loss: 5.90 Probability of being late: 0.4265

Nothing magical here. The loss is minimized when the actual travel time is the mean travel time. The expected loss is close to zero (0.01 on the right axis, whereas the maximum loss is scaled to 1.0). Given the assumed distribution of delay times, there’s about a 40% chance of being late. This is a pretty benign situation and the conclusion is unexpected: Calculate the average time to reach the destination, and leave that much time to get there. FIELDS MEDAL PLEASE!

Catching a Bus

A slightly more complicated situation, which many of us deal with, involves catching a bus. The nice thing about driving is that leaving 5 minutes later means arriving 5 minutes later[1]. However, if you miss your bus, it means waiting for the next one.

To model this, we’ll use the same mean-squared loss for being early, but being late (even by 1 second) will add a large additional delay. Say the bus runs every 20 minutes, and you miss it by 1 second, instead of the small loss of (1 second)^2 the loss is now (1 second + 20 minutes)^2.  For easier comparison I’ll use gamma distribution (with same parameters) of the travel time.

bus_pdf

Using a wait time of 20.00. Mean of pdf: 18.00. Optimal buffer time: 22.230
Min Expected Loss: 0.041 Stdev Loss: 68.52. Probability of being late: 0.0251

So say the bus only comes every 20 minutes. We now leave an extra 4 minutes ahead of time (22 minutes instead of 18 minutes), the expected loss is twice is high, and check out the standard deviation of the loss: 10 times higher (68 vs 5.9)! Notice the probability of being late is only 2.5%, that means we’ve only allowed ourselves to miss the bus 1/40 of the time.  For a daily commuter, that would be once every two months.

Catching a Plane

Now lets analyze the case of catching a plane. Here I’ll use a normal distribution for travel to the airport, average door-to-gate time 2 hours, standard deviation of 15 minutes. We’ll use the same loss-formula as for the bus, except using an additional wait time of 4 hours.

plane_pdfUsing a wait time of 4.00 Mean of pdf: 2.00 Optimal buffer time: 2.616
Min Expected Loss: 0.019 Stdev Loss: 1.36 Probability of being late: 0.0070

Because of the enormous cost of missing ones flight, we have to leave a lot of extra time. If the average door-gate time is 2 hours, we leave 2.6 hours in advance, and only miss the flight 0.7% of the time.

Based on this, if you’re acting rationally (and your loss functions are the ones above), you should miss 1 out of every ~150 flights. So for monthly travel, that would be one flight every ~13 years. So yeah I guess if you’re middle-aged, a really frequent traveller, and have never missed a flight maybe you’re spending too much time in airports, but I wouldn’t sweat it.

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Not always, particularly right at the start of rush hour, but usually
Posted in Uncategorized | Leave a comment