Classifying text with Keras: Visualization

This is part 3 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

The output of our algorithm is a probability distribution based on the softmax function. We specified 14 different categories, so it has 14 dimensions, each containing a number from [0,1], whose values sum to 1. To “classify” a particular entry we simply take the largest probability but our dataset is richer than that.

Visualizing the output of a model can deliver a lot more insight than just “right” or “wrong”. We can see how different classes are related, what datapoints get classified incorrectly, and why that might be. We can see which classes are similar to each other, and which are completely different. In this post, we’ll look at a few different visualization methods.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part3.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Uncategorized | Leave a comment

Classifying Text with Keras: Logging

This is part 2 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

In part 1 we did the bare minimum to create a functional text classifier. In the real world, we like have multiple indicators of performance. This post will demonstrate how to add more extensive measuring and logging to our model.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part2.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Machine Learning | Tagged , | 3 Comments

Classifying Text with Keras: Basic Text Processing

This is part 3 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

This post covers many nuts and bolts involved in creating a text classifier (experienced programmers may wish to skip ahead). We will use a relatively well-behaved text dataset which has been split into 14 categories. This post goes over the basics of tokenizing sentences into words, creating a vocabulary (or using a pre-built one), and finally, constructing and training neural network to classify text into each category.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part1.

Continue reading

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Posted in Machine Learning | Tagged , | Leave a comment

(Hyper) Rational Travel Planning

There’s an expression floating around the rationalist corner of the interwebs: “If you’ve never missed a flight, you’re spending too much time in airports”. The idea is that to be sure of making a flight, one needs to get to the airport really early (duh). All those times of being early add up, though, and are potentially more costly. I’m going to take at look at that, but first let’s start with a similar but simpler situation.

Continuous Travel Time

Consider meeting a friend for dinner. Being a good person you don’t want to be late.
Being a reasonable person you also don’t want to waste your time and be early. You’re
just meeting one person, and they’re a friend, so you value their time as much as you value your own. Thus the cost of being early should be the same as the cost of being late, meaning it should be symmetric. Also, small differences matter less than large ones. Your friend won’t even notice if you’re 1 minute late, but they might get a bit annoyed if you’re 15 minutes late. The squared loss has both of these properties.

Loss(Actual arrival time, desired arrival time) = (actual – desired)^2/scale^2

To get really concrete, we need to have some probability distribution over travel times. For this case, I’ll use the Gamma(shape=3, scale=1) distribution, with a minimum of 15 minutes. Assume that we’re walking/biking/driving in low traffic so that there are no discontinuous jumps.

dinner_pdfMean of pdf: 18.00 Optimal buffer time: 18.000
Min Expected Loss: 0.009 Stdev Loss: 5.90 Probability of being late: 0.4265

Nothing magical here. The loss is minimized when the actual travel time is the mean travel time. The expected loss is close to zero (0.01 on the right axis, whereas the maximum loss is scaled to 1.0). Given the assumed distribution of delay times, there’s about a 40% chance of being late. This is a pretty benign situation and the conclusion is unexpected: Calculate the average time to reach the destination, and leave that much time to get there. FIELDS MEDAL PLEASE!

Catching a Bus

A slightly more complicated situation, which many of us deal with, involves catching a bus. The nice thing about driving is that leaving 5 minutes later means arriving 5 minutes later[1]. However, if you miss your bus, it means waiting for the next one.

To model this, we’ll use the same mean-squared loss for being early, but being late (even by 1 second) will add a large additional delay. Say the bus runs every 20 minutes, and you miss it by 1 second, instead of the small loss of (1 second)^2 the loss is now (1 second + 20 minutes)^2.  For easier comparison I’ll use gamma distribution (with same parameters) of the travel time.

bus_pdf

Using a wait time of 20.00. Mean of pdf: 18.00. Optimal buffer time: 22.230
Min Expected Loss: 0.041 Stdev Loss: 68.52. Probability of being late: 0.0251

So say the bus only comes every 20 minutes. We now leave an extra 4 minutes ahead of time (22 minutes instead of 18 minutes), the expected loss is twice is high, and check out the standard deviation of the loss: 10 times higher (68 vs 5.9)! Notice the probability of being late is only 2.5%, that means we’ve only allowed ourselves to miss the bus 1/40 of the time.  For a daily commuter, that would be once every two months.

Catching a Plane

Now lets analyze the case of catching a plane. Here I’ll use a normal distribution for travel to the airport, average door-to-gate time 2 hours, standard deviation of 15 minutes. We’ll use the same loss-formula as for the bus, except using an additional wait time of 4 hours.

plane_pdfUsing a wait time of 4.00 Mean of pdf: 2.00 Optimal buffer time: 2.616
Min Expected Loss: 0.019 Stdev Loss: 1.36 Probability of being late: 0.0070

Because of the enormous cost of missing ones flight, we have to leave a lot of extra time. If the average door-gate time is 2 hours, we leave 2.6 hours in advance, and only miss the flight 0.7% of the time.

Based on this, if you’re acting rationally (and your loss functions are the ones above), you should miss 1 out of every ~150 flights. So for monthly travel, that would be one flight every ~13 years. So yeah I guess if you’re middle-aged, a really frequent traveller, and have never missed a flight maybe you’re spending too much time in airports, but I wouldn’t sweat it.

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Not always, particularly right at the start of rush hour, but usually
Posted in Uncategorized | Leave a comment

Estimating active reddit users

I’m always curious about how much activity subreddits have, and how the comments are representative of the userbase. It’s well known that the majority of people are lurkers, who just view content but don’t vote or comment. Some subset of those people will actually create accounts and vote on stuff, and some subset of those people will comment. We can count total users (lurkers + voters + commenters) via traffic statistics, and we can count commenters just by analyzing comments.

But how do we count voters? That is, people with accounts who vote but don’t comment. That data isn’t available, all we have to go on are the comments and their vote counts. Statistics to the rescue!

Chao’s Estimator

Here I borrow a technique from ecology used to estimate species richness. Say Ecol O. Gist wants to know how many species of plant exist in a swamp. Well they can go through and count how many they find, but the only way to guarantee that is accurate is to do an exhaustive search of every single plant. Which is impossible.

So instead Dr. Gist comes up with a plan. Go to the swamp, grab a sample of plants, note which species they see. Then come back the next day and do it again. Then come back the next day and do it again. Rinse and repeat a bunch of times. If they continue to see mostly new species each day, that means they haven’t come close to getting a complete count. Once they start to see mostly duplicates, it’s a good bet they’ve seen all the species.

The above hand-wavey explanation is the intuition behind Chao’s estimator of population size [1]. This estimator is very general, it does not assume equal “catchability” (ie probability of being sampled) across all things being sampled. Which is good, it’s much easier to find a tree in ones sample than a particular type of moss. Applied to our case of interest, it’s easier to sample a highly active (~1000 comments) user than a rarely active (~1 comment) one.

This estimator does assume a consistent probability per class, meaning that a tree is just as likely to be sampled one day as the next, or that a user who comments on 10% of days has a 10% chance of commenting on each day independently of the others. This is definitely not true for reddit (weekdays 9-5 are the most active times). It also assumes a fixed population size, which is also not true for reddit (subreddits grow and shrink). Both of these are limitations which the reader should keep in mind.

The relevant formulas that I used are:

(1)   \begin{eqnarray*} \hat{N} = S + \frac{f_1(f_1-1)}{2(f_2 + 1)} \approx S + \frac{f_1^2}{2 f_2} \\ \hat{\sigma}^2 = f_2 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2  [ 0.25 \bigg(\frac{f_1}{(f_2 + 1)}\bigg)^2 + \bigg(\frac{f_1}{(f_2 + 1)}\bigg) + 0.5] \\ \approx \frac{f_1^2}{f_2 } [ 0.25 (f_1/f_2)^2 + f_1/f_2+ 0.5] \end{eqnarray*}

Where f_1 is the number of users observed exactly once, f_2 the number observed exactly twice, S is the total number of users observed, \hat{N} is the estimated number of total users (observed + unobserved), and \hat{\sigma}^2 the estimated variance of\hat{N}. I used the bias-corrected formulas (which have f_2+1) for calculations as they are more accurate and account for f_2=0; the formulas without that correction are much more readable and shown as approximations.

Methods

The dataset was the same one I’ve used previously, all reddit comments from January 2015, separated by subreddit. I required 1000 unique commenters for a subreddit to be included, which left 1322 subreddits. A “sampling period” was considered to be one calendar day. So f_1 is the number of users who commented on exactly one day, regardless of how many times they commented on that day.

A natural method to check for accuracy is to include more and more data and check for convergence. A major drawback to this approach is that it assumes fixed population size (ie no growth in the size of the subreddit); balancing these concerns is part of the reason I stuck to one month of data. 31 days is a decent sample size, and most subreddits won’t grow too much during that time. Still, to counteract any time-related trends I shuffled the days being included, so an increase in users would just look like a fluctuation.

Results

The results for two subreddits are shown below.

example_users_vs_days

These subreddits are roughly similar in size, which is why I picked them. Notice how in the last 5 days the /r/programming curve is fairly flat, while the /r/cars curve continues to climb.

There needs to be some way to determine if something has “converged”. The metric I used was comparing the average change over the last 5 days to the standard deviation[2]. The motivation here is that of course the estimate will fluctuate due to chance, and the standard deviation gives a measure of that. In the case of /r/programming, the estimate only changed by only 255 users, while the standard deviation was 430. Meanwhile at /r/cars, the estimate changed by 1500 users, with a standard deviation of 329.

Of the 1322 subreddits, 446 of them “converged. A scatterplot of the estimated number of accounts vs directly observed commenters is shown below. The red dots show subreddits which did not converge (estimate using all data shown), the blue triangles show subreddits which did converge. Error bars not shown because they are smaller than the points being plotted on a log-log scale.

est_vs_observed_scatter2

Immediately a trend is apparent: this method only converges for small subreddits. There is also a good linear fit of estimated total vs observed commenters which applies to both converged and nonconverged. The pessimistic view is that this means even the converged estimates are invalid, the optimistic view is that even the nonconverged estimates may be valid. It could also be that the true trend is non-linear, for which the linear value is an approximation, making the convergent estimates valid and the nonconvergent estimates invalid.

A histogram of the estimated/observed ratios is shown below:

est_vs_obs_ratio_hist

 

So in general we estimate there are generally 2x as many potential commenters (ie voters) as there are actual commenters. This is lower than I expected based on the 90/9/1 rule, which would put it at around 9x. Then again, the quantity I’m attempting to measure here is a bit odd. It’s people with accounts who could participate, but don’t. Whereas the “9” in that rule refers to people who sparsely participate; that can include commenting as well as voting.

It’s worth noting the outliers, /r/nameaserver and /r/millionairemakers [3]. /r/nameaserver is by invitation of people who buy gold at the right time, people get together and vote to decide on a server name. They have 24 hours to decide. So most people will only comment on one day, leading to an enormous overestimate.

/r/millionairemakers is subreddit which holds drawings once a month, a winner is picked and people are called to donate. The way you enter the drawing is by commenting on the relevant thread. So basically there are going to be a lot of users which comment exactly once on one day.

Both of these subreddits violate the equiprobable assumption of Chao’s estimator, so it’s no surprise the results are odd. The extreme-skepticism view would be to say that this invalidates all the results above. I’m a bit more optimistic; since there are obvious outliers with an obvious explanation for why they are outliers, it may be that the method works well for subreddits which don’t have pathological commenting patterns.

It’s analyses like this that make we want to work at reddit, or at least have access to their whole dataset. In most cases Chao’s estimator will provide an underestimate, so the 2:1 ratio likely is too low. Without tracking users pageviews and votes, though, there is no way to know for sure. It would be really amazing if this statistical magic actually provided an accurate answer.

-Jacob

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
  1. [1]Chao, A. 1987. Estimating the Population Size for Capture-Recapture Data with Unequal Catchability. Biometrics 43: 783-791
  2. [2]Specifically, I performed a linear regression on the last 5 days and calculated the change based on the slope
  3. [3]These also had the lowest Gini coefficients in my last analysis
Posted in reddit, Statistics | Leave a comment