Classifying Text with Keras: Basic Text Processing

This is part 1 of a three-part series describing text processing and classification. Part 1 covers input data preparation and neural network construction, part 2 adds a variety of quality metrics, and part 3 visualizes the results.

Contents

This post covers many nuts and bolts involved in creating a text classifier (experienced programmers may wish to skip ahead). We will use a relatively well-behaved text dataset which has been split into 14 categories. This post goes over the basics of tokenizing sentences into words, creating a vocabulary (or using a pre-built one), and finally, constructing and training neural network to classify text into each category.

Full code described in this entry available on GitHub, in dbpedia_classify/tree/part1.

Prerequisites

Keras (version 2, released March 14, 2017)

nltk, The Python Natural Language Toolkit

Tensorflow (version 0.12.1) recommended and required for later parts, for this part any backend for Keras should work (ie Theano)

Gensim

Training Data

The dataset we’ll be using is The DBpedia ontology classification dataset. Based on DBPedia^[1], this dataset was assembled from 14 non-overlapping categories for “Character-level Convolutional Networks for Text Classification”^[2] and contains the class, title, and description of each item. The dataset is available (along with other data for that paper) here.

Reading in the data is done as follows:

def desc_dict_generator(input_path, fieldnames=['class', 'title','description'], 
        text_field='description'):
    csv_reader = csv.DictReader(open(input_path, 'r'), fieldnames=fieldnames)
    for cur_dict in csv_reader:
        text = cur_dict[text_field].strip().lower()
        # Don't love this but what can you do
        text = text.decode("ascii","ignore").encode("ascii")
        cur_dict['word_list'] = nltk.word_tokenize(text)
        yield cur_dict

The basic flow is sample; we iterate through the CSV file and extract the ‘description’ field. A few things are worth noticing:

Leading and trailing whitespace is removed, and all text is converted to lowercase.
Non-ASCII characters are removed. I hate doing this, but in complicated programs there is often some component somewhere which can’t handle non-ASCII characters. For simplicity I’m just ignoring them for now.
The text is tokenized using nltk.word_tokenize^[3]. Word tokenizing is a deceptively difficult problem; the simplest method is to just split on whitespace but that’s often inaccurate. For instance, the last sentence would read “problem;”, “that’s”, and “inaccurate.” as words when they should be separated into “problem” , “;” , “that” , “‘s”, “inaccurate” , “.” . The nltk word tokenizer handles those complexities more intelligently, although there is a serious performance hit. Tokenizing the text can often be a limiting factor of performance.
We use the yield keyword to return the result as a generator. More details on that in the next section.

Text Generators and Batch Construction

The DBPedia training dataset has 560,000 rows, and is about 160 MB uncompressed. Not the largest dataset out there but still more than we’d like to load into memory. So instead we read in each line as a generator using the yield keyword:

def create_desc_generator(input_path, word2id, indefinite=False, min_word_count=10):
    _finished = False
    while not _finished:
        dict_generator = desc_dict_generator(input_path)
        for cur_dict in dict_generator:
            word_list = cur_dict['word_list']
            int_word_list = [word2id[w] for w in word_list if w in word2id]
            if len(int_word_list) &lt; min_word_count:
                continue
            cur_dict['int_word_list'] = int_word_list
            yield cur_dict
 
        _finished = not indefinite

Because training is performed on a batch of data (rather than just one entry at a time), we need a function to create a batch based on this generator:

def create_training_batch(generator, num_classes, max_input_length, 
   max_batch_size=64, return_raw_text=False):
    text_data = []
    X_ = []
    y_ = []
    for info_dict in generator:
        # Pick out the sequence of integers as the words
        seq_data = info_dict['int_word_list']
        # Change the class numbering to 0-based
        _class = int(info_dict['class']) - 1
 
        text_data.append(info_dict['word_list'])
        X_.append(seq_data)
        y_.append(_class)
 
        if len(y_) &lt;= max_batch_size: break # The sequences must all be the same length # So any which are shorter we pad up until the maximum input length X_train = sequence.pad_sequences(X_, maxlen=max_input_length) # Change from class number (0,1,2,etc.) to a one-hot matrix # (class 0 --&gt; [1 0 0 ..], class 2 --&gt; [0 0 1 ...])
    y_train = np_utils.to_categorical(y_, nb_classes=num_classes)
 
    if return_raw_text:
        return X_train, y_train, text_data
    else:
        return X_train, y_train

The “return_raw_text” parameter is handy for debugging but not strictly necessary. Putting these two together, we create a generator for making a batch of data:

def create_batch_generator(input_path, word2id, num_classes,
 max_input_length, batch_size, return_raw_text=False):
    desc_generator = create_desc_generator(input_path, word2id, indefinite=True)
    while True:
        cur_batch = create_training_batch(desc_generator, num_classes, 
 max_input_length, batch_size, return_raw_text=return_raw_text)
        yield cur_batch

So now we have two generators:

create_desc_generator, which creates a generator parsing out once sentence at a time.
create_batch_generator, which generates a batch of training data based on the sentences output from create_desc_generator

And oh yeah, one more thing: The DBpedia data is in sorted class order. When training, we want a random mixture of classes in each batch, so the linear iteration proposed here obviously won’t work. The quickest fix is to use the “shuf” command (present on almost every modern *nix systems as well as Cygwin)

shuf train.csv &gt; train_shuf.csv

Defining our own Vocabulary (Optional)

A “vocabulary” is simply a finite set of words we consider valid, each assigned to an integer. So we can represent the sentence “I like turtles” with the sequence [2, 8, 124] (assuming that 1 “I”, 8 “like”, 124 “turtles”)^[4]

Using Gensim, this process is quite simple:

vocab_model = Word2Vec(size=embedding_size, max_vocab_size=max_vocab_size, 
min_count=min_word_count, workers=2, seed=2245)
 
print('{0}: Building own vocabulary'.format(datetime.datetime.now()))
desc_generator = basic_desc_generator(train_path)
vocab_model.build_vocab(desc_generator)
print('{0}: Saving vocabulary to {1}'.format(datetime.datetime.now(), vocab_path))
vocab_model.save(vocab_path)

Gensims model.build_vocab takes an iterable (list or generator) of lists of words, formatted as strings, and from that generates a vocabulary. We don’t have word-vectors yet, those will be trained once the full model is built.

This step is “optional” because if we use pre-trained word vectors, the vocabulary is already defined. One could also imagine mixing and matching (defining our own vocabulary and pre-populating word vectors), that is left as an exercise to the reader.

Using Pre-Trained Vectors

Embedding vectors represent each word as a length-N vector of floating point numbers, and have been remarkably successful when used as features in natural language processing. Google has published a set of vectors trained on a Google News dataset^[5]. The binary file contains the words themselves, along with 300-dimensional vector representations. We can load the word vectors using Gensim very simply:

google_word2vec = '/path/to/GoogleNews-vectors-negative300.bin.gz'
vocab_model = Word2Vec.load_word2vec_format(google_word2vec, limit=top_words, binary=True)
# Extract the data structures we need
# Matrix of word vectors; `top_words` x 300
embedding_matrix = vocab_model.syn0
# Dictionary mapping from word --&gt; row of embedding matrix
vocab_dict = {word: vocab_model.vocab[word].index for word in vocab_model.vocab.keys()}

Thank goodness for pre-trained word vectors, this gives a whole new meaning to the phrase “Google it”!

We’re almost there. We now have code to parse out words from raw text, a dictionary to convert those words into integers, and an embedding matrix to convert those integers into 300-dimensional word vectors.

The Neural Network Model

The neural network itself is based on Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras, and the Keras example “IMDB CNN LSTM”. Both of those tutorials use the IMDB dataset, which has already been parsed into integers representing words. Converting free-form text into a nice clean integer-coded vocabulary is what this post is all about.

Using Keras, defining the model is ridiculously easy:

def build_lstm_model(top_words, embedding_size, max_input_length, num_outputs,
                    internal_lstm_size=100, embedding_matrix=None, embedding_trainable=True):
    """
    Parameters
    top_words : int
        Size of the vocabulary
    embedding_size : int
        Number of dimensions of the word embedding. e.g. 300 for Google word2vec
    embedding_matrix: None, or `top_words` x `embedding_size` matrix
        Initial/pre-trained embeddings
    embedding_trainable : bool
        Whether we should train the word embeddings. Must be true if no embedding matrix provided
    """
 
    if not embedding_trainable:
        assert embedding_matrix is not None, "Must provide an embedding matrix if not training one"
 
    _weights = None
    if embedding_matrix is not None:
        _weights = [embedding_matrix]
 
    model = Sequential()
    model.add(Embedding(top_words, embedding_size, input_length=max_input_length, weights=_weights, trainable=embedding_trainable))
    model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
    model.add(MaxPooling1D(pool_length=2))
    model.add(LSTM(internal_lstm_size))
    model.add(Dense(num_outputs, activation='softmax'))
    return model
 
model = build_lstm_model(max_vocab_size, embedding_size, max_input_length, num_classes,
        embedding_matrix=embedding_matrix, embedding_trainable=embedding_trainable)

An embedding layer to represent words as vectors, convolution and max-pooling to combine adjacent words, an LSTM to process words in the sentence, and finally a dense layer to classify the output into 1 of 14 classes. The whole thing takes seven lines of code. That’s shorter than most of the utility methods from above. Keras makes for some elegant code.

Training

All we need to do is specify a loss quantity to minimize, and an optimization algorithm. Both of which can be done via strings.

loss_ = 'categorical_crossentropy'
optimizer_ = 'adam'
# Specify metrics to save
log_metrics = ['categorical_accuracy', 'categorical_crossentropy']
# Save the model every epoch
model_saver = keras.callbacks.ModelCheckpoint(model_path,verbose=1)
_callbacks = [model_saver]
 
### Compile model
model.compile(loss=loss_, optimizer=optimizer_, metrics=log_metrics)
 
### Run training
train_generator = create_batch_generator(train_path, vocab_dict, num_classes, max_input_length, batch_size)
model.fit_generator(train_generator, samples_per_epoch, nb_epoch, callbacks=_callbacks, initial_epoch=initial_epoch)

And we’re done! I put a few more bells and whistles in the full code (available on GitHub), mostly so the model could resume runs which were previously interrupted. With a batch size = 100, batches per epoch = 10, and total epochs = 100 (so trained on 100,000 samples total) training on my desktop machine (4 core CPU, 2.6 GHz, 16 GB RAM) was as follows:

100,000 training samples

Training Time: 00:32:51. That is, 32 minutes, 51 seconds.
Categorical Accuracy: 0.9400
Categorical Crossentropy: 0.2084

One 1,000 sample validation set:

Classification Time: 7.7 seconds
Categorical Accuracy: 0.918
Categorical Crossentropy: 0.2652

With 14 classes of approximately equal size, chance would have an accuracy 0.0714. Clearly we’re doing better than that. Not bad!

What we’ve done here is write the barest minimum of a text classifier. The only quality metrics listed are training/runtime, classification accuracy, and cross-entropy loss. Next post I’ll take expand on quality metrics, so we can more precisely measure the performance of our model.

-Jacob

^[1]J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia – a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014↩
^[2]Zhang, Xiang, Junbo Zhao, and Yann LeCun. “Character-level convolutional networks for text classification.” Advances in neural information processing systems. 2015. http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification↩
^[3]According to the documentation the text should be sentence-tokenized first. I didn’t notice a difference in results when skipping sentence tokenization, possibly because the relevant text pieces are fairly short↩
^[4]Customarily the integers are sorted in descending order of frequency though that’s just a convention.↩
^[5]https://code.google.com/archive/p/word2vec/↩