Introduction
Question answering has recieved more focus as large search engines have basically mastered general information retrieval and are starting to cover more edge cases. Question answering happens to be one of those edge cases, because it could involve a lot of syntatic nuance that doesn’t get captured by standard information retrieval models, like LDA or LSI. Hypothetically, deep learning models would be better suited to this type of task because of their ability to capture higherorder syntax. Two papers, “Applying deep learning to answer selection: a study and an open task” (Feng et. al. 2015) and “LSTMbased deep learning models for nonfactoid answer selection” (Tan et. al. 2016) , are recent examples which have applied deep learning to questionanswering tasks with good results.
Feng et. al. used an inhouse Java framework for their work, and Tan et. al. built their model entirely from Theano. Personally, I am a lot lazier than them, and I don’t understand CNNs very well, so I would like to use an existing framework to build one of their models to see if I could get similar results. Keras is a really popular one that has support for everything we might need to put the model together.
The Github repository for this project can be found here .
Installing Keras
See the instructions here on how to install Keras. The simple route is to install using pip
, e.g.
sudo pip install upgrade keras
There are some important features that might not be available without the most recent version. I’m not sure if doing pip install
gets the most recent version, so it might be helpful to install from binary. This is actually pretty straightforward! Just change to the directory where you want your source code to be and do:
git clone https://github.com/fchollet/keras.git cd keras sudo python setup.py install
One benefit of this is that if you want to add a custom layer, you can add it to the Keras installation and be able to use it across different projects. Even better, you could fork the project and clone your own fork, although this gets into areas of Git beyond my understanding.
Jumping into language modeling
There are actually a couple language models in the Keras examples :

imdb_lstm.py
: Using a LSTM recurrent neural network to do sentiment analysis on the IMDB dataset 
imdb_cnn_lstm.py
The same task, but this time using a CNN layer beneath the LSTM layer 
babi_rnn.py
: Recurrent neural networks for modeling Facebook’s bAbi dataset, “a mixture of 20 tasks for testing text understanding and reasoning”
These are pretty interesting to play around with. It is really cool how easy it is to get one of these set up! With Keras, a highlevel model design can be quickly implemented.
Word Embeddings
Ok! Let’s dive in. The first challenge that you might think of when designing a language model is what the units of the language might be. A reasonable dataset might have around 20000 distinct words, after lemmatizing them. If the average sentence is 40 words long, then you’re left with a 20000 x 40
matrix just to represent one sentence, which is 3.2 megabytes if each word is represented in 32 bits. This obviously doesn’t work, so the first step in developing a good language model is to figure out how to reduce the number of dimensions required to represent a word.
One popular method of doing this is using word2vec
. word2vec
is a way of embedding words in a vector space so that words that are semantically similar are near each other. There are some interesting consequences of doing this, like being able to do word addition and subtraction:
king  man + woman = queen
In Keras, this is available as an Embedding
layer. This layer takes as input a (n_batches, sentence_length)
dimensional matrix of integers representing each word in the corpus, and outputs a (n_batches, sentence_length, n_embedding_dims)
dimensional matrix, where the last dimension is the word embedding.
There are two advantages to this. The first is space: Instead of 3.2 megabytes, a 40 word sentence embedded in 100 dimensions would only take 16 kilobytes, which is much more reasonable. More importantly, word embeddings give the model a hint at the meaning of each word, so it will converge more quickly. There are significantly fewer parameters which have to be jostled around, and parameters are sort of tied together in a sensible way so that they jostle in the right direction.
Here’s how you would go about writing something like this:
from keras.layers import Input, Embedding input_sentence = Input(shape=(sentence_maxlen,), dtype='int32') embedding = Embedding(n_words, n_embed_dims)(input_sentence)
Let’s try this out! We can train a recurrent neural network to predict some dummy data and examine the embedding layer for each vector. This model takes a sentence like “sam is red” or “sarah not green” and predicts what color the person is. It is a very simple example, but it will illustrate what the Embedding layer is doing, and also illustrate how we can turn a bunch of sentences into vectors of indices by building a dictionary.
import itertools import numpy as np sentences = ''' sam is red hannah not red hannah is green bob is green bob not red sam not green sarah is red sarah not green'''.strip().split('/n') is_green = np.asarray([[0, 1, 1, 1, 1, 0, 0, 0]], dtype='int32').T lemma = lambda x: x.strip().lower().split(' ') sentences_lemmatized = [lemma(sentence) for sentence in sentences] words = set(itertools.chain(*sentences_lemmatized)) # set(['boy', 'fed', 'ate', 'cat', 'kicked', 'hat']) # dictionaries for converting words to integers and vice versa word2idx = dict((v, i) for i, v in enumerate(words)) idx2word = list(words) # convert the sentences a numpy array to_idx = lambda x: [word2idx[word] for word in x] sentences_idx = [to_idx(sentence) for sentence in sentences_lemmatized] sentences_array = np.asarray(sentences_idx, dtype='int32') # parameters for the model sentence_maxlen = 3 n_words = len(words) n_embed_dims = 3 # put together a model to predict from keras.layers import Input, Embedding, merge, Flatten, RNN from keras.models import Model input_sentence = Input(shape=(sentence_maxlen,), dtype='int32') input_embedding = Embedding(n_words, n_embed_dims)(input_sentence) color_prediction = RNN(1)(input_embedding) predict_green = Model(input=[input_sentence], output=[color_prediction]) predict_green.compile(optimizer='sgd', loss='binary_crossentropy') # fit the model to predict what color each person is predict_green.fit([sentences_array], [is_green], nb_epoch=5000, verbose=1) embeddings = predict_green.layers[1].W.get_value() # print out the embedding vector associated with each word for i in range(n_words): print('{}: {}'.format(idx2word[i], embeddings[i]))
The embedding layer embeds the words into 3 dimensions. A sample of the vectors it produces is seen below. As predicted, the model learns useful word embeddings.
sarah: [0.5835458 0.2772688 0.01127077] sam: [0.57449967 0.26132962 0.04002968] bob: [ 1.10480607 0.97720605 0.10953052] hannah: [ 1.12466967 0.95199704 0.13520472] not: [0.17611612 0.2958962 0.06028322] is: [0.10752882 0.34842652 0.06909169] red: [0.10381682 0.31055665 0.0975003 ] green: [0.05930901 0.33241618 0.06948926]
Each category is grouped in the 3dimensional vector space. The network learned each of these categories from how each word was used; Sarah and Sam are the red people, while Bob and Hannah are the green people. However, it did not differentiate well between not
, is
, red
, and green
, because those weren’t immediately obvious for the decision task.
Recurrent Neural Networks
As the Keras examples illustrate, there are different philosophies on deep language modeling. Feng et. al. did a bunch of benchmarks with convolutional networks, and ended up with some impressive results. Tan et. al. used recurrent networks with some different parameters. I’ll focus on recurrent neural networks first (What do pirates call neural networks? Arrrgh NNs). I’ll assume some familiarity with both recurrent and convolutional neural networks. Andrej Karpathy’s blog discusses recurrent neural networks in detail. Here is an image from that post which explains the core concept:
Vanilla
The basic RNN architecture is essentially a feedforward neural network that is stretched out over a bunch of time steps and has it’s intermediate output added to the next input step. This idea can be expressed as an update equation for each input step:
new_hidden_state = tanh(dot(input_vector, W) + dot(prev_hidden, U) + b)
Note that dot
indicates vectormatrix multiplication. Multiplying a vector of dimensions <m>
by a matrix of dimensions <m, n>
can be done with dot(<m>, <m, n>)
and yields a vector of dimensions <n>
. This is consistent with its usage in Theano and Keras. In the update equation, we multiply each input_vector
by our input weights W
, multiply the prev_hidden
vector by our hidden weights U
, and add a bias, before passing the sum to the activation function sigmoid
. To get the many to one behavior in the image, we can grab the last hidden state and use that as our output. To get the one to many behavior, we can pass one input vector and then just pass a bunch of zero vectors to get as many hidden states as we want.
LSTM
If the RNN gets really long, then we run into a lot of difficulty training the model. The effect of something a early in the sequence on the end result is very small relative to later components, so it is hard to use that information in updating the weights. To solve this, several methods have been proposed, and two have been implemented in Keras. The first is the Long ShortTerm Memory (LSTM) unit, which was proposed by Hochreiter and Schmidhuber 1997 . This model uses a second hidden state which stores information from further back in the model, allowing that information to have a stronger effect on the end result. The update equations for this model are:
input_gate = tanh(dot(input_vector, W_input) + dot(prev_hidden, U_input) + b_input) forget_gate = tanh(dot(input_vector, W_forget) + dot(prev_hidden, U_forget) + b_forget) output_gate = tanh(dot(input_vector, W_output) + dot(prev_hidden, U_output) + b_output) candidate_state = tanh(dot(x, W_hidden) + dot(prev_hidden, U_hidden) + b_hidden) memory_unit = prev_candidate_state * forget_gate + candidate_state * input_gate new_hidden_state = tanh(memory_unit) * output_gate
Note that *
indicates elementwise multiplication. This is consistent with its usage in Theano and Keras. First, there are a bunch more parameters to train; not only do we have weights for the inputtohidden and hiddentohidden matrices, but also we have an accompanying candidate_state
. The candidate state is like a second hidden state that transfers information to and from the hidden state. It is like a safety deposit box for putting things in and taking things out.
GRU
The second model is the Gated Recurrent Unit (GRU), which was proposed by Cho et. al. 2014 . The equations for this model are as follows:
update_gate = tanh(dot(input_vector, W_update) + dot(prev_hidden, U_update) + b_update) reset_gate = tanh(dot(input_vector, W_reset) + dot(prev_hidden, U_reset) + b_reset) reset_hidden = prev_hidden * reset_gate temp_state = tanh(dot(input_vector, W_hidden) + dot(reset_hidden, U_reset) + b_hidden) new_hidden_state = (1  update_gate) * temp_state + update_gate * prev_hidden
In this model, there is an update_gate
which controls how much of the previous hidden state to carry over to the new hidden state and a reset_gate
which controls how much the previous hidden state changes. This allows potentially longterm dependencies to be propagated through the network.
My implementations of these models in Theano, as well as optimizers for training them, can be found in this Github repository .
RNN Example: Predicting Dummy Data
Now that we’ve seen the equations, let’s see how Keras implementations compare on some sample data.
import numpy as np rng = np.random.RandomState(42) # parameters input_dims, output_dims = 10, 1 sequence_length = 20 n_test = 10 # generate some random data to train on get_rand = lambda *shape: np.asarray(rng.rand(*shape) > 0.5, dtype='float32') X_data = np.asarray([get_rand(sequence_length, input_dims) for _ in range(n_test)]) y_data = np.asarray([get_rand(output_dims,) for _ in range(n_test)]) # put together rnn models from keras.layers import Input from keras.layers.recurrent import SimpleRNN, LSTM, GRU from keras.models import Model import theano input_sequence = Input(shape=(sequence_length, input_dims,), dtype='float32') vanilla = SimpleRNN(output_dims, return_sequences=False)(input_sequence) lstm = LSTM(output_dims, return_sequences=False)(input_sequence) gru = GRU(output_dims, return_sequences=False)(input_sequence) rnns = [vanilla, lstm, gru] # train the models for rnn in rnns: model = Model(input=[input_sequence], output=rnn) model.compile(optimizer='rmsprop', loss='binary_crossentropy') model.fit([X_data], [y_data], nb_epoch=1000)
The results will vary from trial to trial. RNNs are exceptionally difficult to train. However, in general, a model that can take advantage of longterm dependencies will have a much easier time seeing how two sequences are different.
Attentional RNNs
It isn’t strictly important to understand the RNN part before looking at this part, but it will help everything make more sense. The next component of language modeling, which was the focus of the Tan paper, is the Attentional RNN. This essential components of model is described in “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” (Xu et. al. 2016) . I’ll try to hash it out in this blog post a little bit and look at how to build it in Keras.
Lambda Layer
First, let’s look at how to make a custom layer in Keras. There are a couple options. One is the Lambda
layer, which does a specified operation. An example of this could be a layer that doubles the value it is passed:
from keras.layers import Lambda, Input from keras.models import Model import numpy as np input = Input(shape=(5,), dtype='int32') double = Lambda(lambda x: 2 * x)(input) model = Model(input=[input], output=[double]) model.compile(optimizer='sgd', loss='mse') data = np.arange(5) print(model.predict(data))
This doubles our input data. Note that there are no trainable weights anywhere in this model, so it couldn’t actually learn anything. What if we wanted to multiply our input vector by some trainable scalar that predicts the output vector? In this case, we will have to write our own layer.
Building a Custom Layer Example
Let’s jump right in and write a layer that learns to multiply an input by a scalar value and produce an output.
from keras.engine import Layer from keras import initializations # our layer will take input shape (nb_samples, 1) class MultiplicationLayer(Layer): def __init__(self, **kwargs): self.init = initializations.get('glorot_uniform') super(MultiplicationLayer, self).__init__(**kwargs) def build(self, input_shape): # each sample should be a scalar assert len(input_shape) == 2 and input_shape[1] == 1 self.multiplicand = self.init(input_shape[1:], name='multiplicand') # let Keras know that we want to train the multiplicand self.trainable_weights = [self.multiplicand] def get_output_shape_for(self, input_shape): # we're doing a scalar multiply, so we don't change the input shape assert input_shape and len(input_shape) == 2 and input_shape[1] == 1 return input_shape def call(self, x, mask=None): # this is called during MultiplicationLayer()(input) return x * self.multiplicand # test the model from keras.layers import Input from keras.models import Model # input is a single scalar input = Input(shape=(1,), dtype='int32') multiply = MultiplicationLayer()(input) model = Model(input=[input], output=[multiply]) model.compile(optimizer='sgd', loss='mse') import numpy as np input_data = np.arange(10) output_data = 3 * input_data model.fit([input_data], [output_data], nb_epoch=10) print(model.layers[1].multiplicand.get_value()) # should be close to 3
There we go! We have a complete model. We could change it around to make it fancier, like adding a broadcastable dimension to the multiplicand
so that the layer could be passed a vector of numbers instead of just a scalar. Let’s look closer at how we built the multiplication layer:
def __init__(self, **kwargs): self.init = initializations.get('glorot_uniform') super(MultiplicationLayer, self).__init__(**kwargs)
First, we make a weight initializer that we can use later to get weights. glorot_uniform
is just a particular way to initialize weights. We then call the __init__
method of the super class.
def build(self, input_shape): # each sample should be a scalar assert len(input_shape) == 2 and input_shape[1] == 1 self.multiplicand = self.init(input_shape[1:], name='multiplicand') # let Keras know that we want to train the multiplicand self.trainable_weights = [self.multiplicand]
This method specifies the components of the model, for when we build it. The only component we need is the scalar to multiply by, so we initialize a new tensor by calling self.init
, the initializer we created in the __init__
method.
def get_output_shape_for(self, input_shape): # we're doing a scalar multiply, so we don't change the input shape assert input_shape and len(input_shape) == 2 and input_shape[1] == 1 return input_shape
This method tells the builder what the output shape of this layer will be given its input shape. Since our layer just does a scalar multiply, it doesn’t change the output shape from the input shape. For example, scalar multiplying the input [1, 2, 3]
of dimensions <3, 1>
by a scalar factor of 2 gives the output [2, 4, 6]
, which has the same dimensions <3, 1>
.
def call(self, x, mask=None): # this is called during MultiplicationLayer()(input) return x * self.multiplicand
This is the bread and butter of the the layer, where we actually perform the operation. We specify that the output of this layer is the input x
matrix multiplied by our multiplicand tensor. Note that this method takes a while to run, because whatever backend we use (for example, Theano) has to put together the tensors in the right way. To make your layer run quickly, it is good practice to add assert
checks in the build
and get_output_shape_for
methods.
Characterizing the Attentional LSTM
Now that we’ve got an idea of how to build a custom layer, let’s look at the specifications for an attentional LSTM. Following Tan et. al. , we can augment our LSTM equations from earlier to include an attentional component. The attentional component requires some attention vector attention_vec
.
input_gate = tanh(dot(input_vector, W_input) + dot(prev_hidden, U_input) + b_input) forget_gate = tanh(dot(input_vector, W_forget) + dot(prev_hidden, U_forget) + b_forget) output_gate = tanh(dot(input_vector, W_output) + dot(prev_hidden, U_output) + b_output) candidate_state = tanh(dot(x, W_hidden) + dot(prev_hidden, U_hidden) + b_hidden) memory_unit = prev_candidate_state * forget_gate + candidate_state * input_gate new_hidden_state = tanh(memory_unit) * output_gate attention_state = tanh(dot(attention_vec, W_attn) + dot(new_hidden_state, U_attn)) attention_param = exp(dot(attention_state, W_param)) new_hidden_state = new_hidden_state * attention_param
The new equations are the last three, which correspond to equations 9, 10 and 11 from the paper (approximately reproduced below, using different notation).
The attention parameter is a function of the current hidden state and the attention vector mixed together. Each is first put through a matrix, summed and put through an activation function to get an attention state, which is then put through another transformation to get an attention parameter. The attention parameter then reupdates the hidden state. Supposedly, this is conceptually similar to TFIDF weighting, where the model learns to weight particular states at particular times.
Building an Attentional LSTM Example
Now that we have all the components for an Attentional LSTM, let’s see the code for how we could implement this. The attentional component can be tacked onto the LSTM code that already exists.
from keras import backend as K from keras.layers import LSTM class AttentionLSTM(LSTM): def __init__(self, output_dim, attention_vec, **kwargs): self.attention_vec = attention_vec super(AttentionLSTM, self).__init__(output_dim, **kwargs) def build(self, input_shape): super(AttentionLSTM, self).build(input_shape) assert hasattr(self.attention_vec, '_keras_shape') attention_dim = self.attention_vec._keras_shape[1] self.U_a = self.inner_init((self.output_dim, self.output_dim), name='{}_U_a'.format(self.name)) self.b_a = K.zeros((self.output_dim,), name='{}_b_a'.format(self.name)) self.U_m = self.inner_init((attention_dim, self.output_dim), name='{}_U_m'.format(self.name)) self.b_m = K.zeros((self.output_dim,), name='{}_b_m'.format(self.name)) self.U_s = self.inner_init((self.output_dim, self.output_dim), name='{}_U_s'.format(self.name)) self.b_s = K.zeros((self.output_dim,), name='{}_b_s'.format(self.name)) self.trainable_weights += [self.U_a, self.U_m, self.U_s, self.b_a, self.b_m, self.b_s] if self.initial_weights is not None: self.set_weights(self.initial_weights) del self.initial_weights def step(self, x, states): h, [h, c] = super(AttentionLSTM, self).step(x, states) attention = states[4] m = K.tanh(K.dot(h, self.U_a) + attention + self.b_a) s = K.exp(K.dot(m, self.U_s) + self.b_s) h = h * s return h, [h, c] def get_constants(self, x): constants = super(AttentionLSTM, self).get_constants(x) constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m) return constants
Let’s look at what each function is doing individually. Note that this builds heavily upon the alreadyexisting LSTM implementation.
from keras import backend as K from keras.layers import LSTM
We will create a subclass (does python even do subclasses?) of the LSTM implementation that Keras already provides. The Keras backend
is either Theano or Tensorflow, depending on the settings specified in ~/.keras/keras.json
(the default is Theano). This backend lets us use Theanotype functions such as K.zeros
, which specifies a matrix of zeros, to initialize our model.
def __init__(self, output_dim, attention_vec, **kwargs): self.attention_vec = attention_vec super(AttentionLSTM, self).__init__(output_dim, **kwargs)
We initialize the layer by passing it the out number of hidden layers output_dim
and the layer to use as the attention vector attention_vec
. The __init__
function is identical to the __init__
function for the LSTM
layer except for the attention vector, so we just reuse it here.
def build(self, input_shape):
I won’t reproduce everything here, but essentially this method initializes all of the weight matrices we need for the attentional component, after calling the LSTM.build
method to initialize the LSTM weight matrices.
def step(self, x, states): h, [h, c] = super(AttentionLSTM, self).step(x, states) attention = states[4] m = K.tanh(K.dot(h, self.U_a) + attention + self.b_a) s = K.exp(K.dot(m, self.U_s) + self.b_s) h = h * s return h, [h, c]
This method is used by the RNN
superclass, and tells the function what to do on each timestep. It mirrors the equations given earlier, and adds the attentional component on top of the LSTM equations.
def get_constants(self, x): constants = super(AttentionLSTM, self).get_constants(x) constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m) return constants
This method is used by the LSTM superclass to define components outside of the step function, so that they don’t need to be recomputed very time step. In our case, the attentional vector doesn’t need to be recomputed every time step, so we define it as a constant (we then grab it in the step
function using attention = states[4]
).
Convolutional Neural Networks
I will add something here when I actually understand how these work at a sufficient level. Basically, with language modeling, a common strategy is to apply a ton (on the order of 1000) convolutional filters to the embedding layer followed by a max1 pooling function and call it a day. It actually works stupidly well for question answering (see Feng et. al. for benchmarks). In the mean time, I will dump some code here that might be helpful to pour over.
from keras.layers import Convolution1D cnns = [Convolution1D(filter_length=filt, nb_filter=1000, border_mode='same') for filt in [2, 3, 5, 7]] question = merge([cnn(question) for cnn in cnns], mode='concat') answer = merge([cnn(answer) for cnn in cnns], mode='concat')
Similarity Metrics
The basic idea with question answering is to embed questions and answers as vectors, so that the question vector is close in vector space to the answer vector. For example, with the Attentional RNN, we take the question vector and use it as an input for generating the answer vector. A common approach is to then rank answer vectors according to their cosine similarities with the question vector. This doesn’t follow the conventional neural network architecture, and takes some manipulation to achieve in Keras. To use equations, what we would like to do is:
best answer = argmax(cos(question, answers))
Training is generally done by minimizing hinge loss. In this case, we want the cosine similarity for the correct answer to go up, and the cosine similarity for an incorrect answer to go down. The loss function can be formulated as:
loss = max(0, constant margin  cos(question, good answer) + cos(question, bad answer))
Note that for implementations, having a loss of zero can be troublesome, so a small value like 1e6
is generally preferable instead. The loss is zero when the difference between the cosine similarities of the good and bad answers is greater than the constant margin we defined. In practice, the margins generally range from 0.001 to 0.2. If we want to use something besides cosine similarity, we can reformulate this as
loss = max(0, constant margin  sim(question, good answer) + sim(question, bad answer))
where sim
is our similarity metric. Hinge loss works well for this application, as opposed to something like mean squared error, because we don’t want our question vectors to be orthogonal with the bad answer vectors, we just want the bad answer vectors to be a good distance away.
Cosine Similarity Example: Rotational Matrix
First, let’s look at how to do cosine similarity within the constraints of Keras. Fortunately, Keras has an implementation of cosine similarity, as a mode
argument to the merge
layer. This is done with:
from keras.layers import merge cosine_sim = merge([a, b], mode='cos', dot_axes=1)
If we pass it two inputs of dimensions (a, b, c)
, it will calculate the cosine simliarity of the c
dimension (specified using dot_axes
) and give an output of dimensions (a, b)
. However, because we might eventually want to implement other types of similarities besides cosine similarity, let’s look at how this can be done by passing a lambda function to merge
.
def similarity(x): return (x[0] * x[1]).sum() / ((x[0] * x[0]).sum() * (x[1] * x[1]).sum()) cosine_sim = merge([a, b], mode=similarity, output_shape=lambda x: x[0])
We define a function similarity
which we will use to compute the similarity of the inputs passed to the merge
layer. Note that when we do this, we also have to pass an output_shape
which tells Keras what shape the output will be after we do this operation (hopefully in the future this shape will be inferred, but it is still an open issue in the Github group).
A cool example might be to see if we can learn a rotation matrix. A rotation matrix in Euclidean space is a matrix which rotates a vector by a certain angle around the origin. It is defined as a function of theta
, the angle to rotate by:
We can learn this matrix really simply with the right dataset and one dense layer, that is:
from keras.layers import Input, Dense from keras.models import Model a = Input(shape=(2,), name='a') b = Input(shape=(2,), name='b') a_rotated = Dense(2, activation='linear')(a) model = Model(input=[a], output=[a_rotated]) model.compile(optimizer='sgd', loss='mse') import numpy as np a_data = np.asarray([[0, 1], [1, 0], [0, 1], [1, 0]]) b_data = np.asarray([[1, 0], [0, 1], [1, 0], [0, 1]]) model.fit([a_data], [b_data], nb_epoch=1000) print(model.layers[1].W.get_value())
A Dense
layer with linear
activation is the exact same as a matrix multiplication. We give it two input dimensions and two output dimensions. After training this model, the printed weight matrix is:
[[0.00603954, 0.99370766] [ 0.99173903, 0.0078686 ]]
which is close to the rotation matrix for an angle of 90 degrees. Let’s try this again, but with cosine similarity. This will require some manipulation. In the previous example, we had a clearly defined input, a
, and output, b
, and our model was designed to perform a transformaion on a
to predict b
. In this example, we have two inputs, a
and b
, and we will perform a transformation on a
to make it close to b
. As an output, we get the similarity of the two vectors, so we need to train our model to make this similarity high by providing it a bunch of 1’s.
from keras.layers import Input, Dense, merge from keras.models import Model from keras import backend as K a = Input(shape=(2,), name='a') b = Input(shape=(2,), name='b') a_rotated = Dense(2, activation='linear')(a) def cosine(x): axis = len(x[0]._keras_shape)1 dot = lambda a, b: K.batch_dot(a, b, axes=axis) return dot(x[0], x[1]) / K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1])) cosine_sim = merge([a_rotated, b], mode=cosine, output_shape=lambda x: x[:1]) model = Model(input=[a, b], output=[cosine_sim]) model.compile(optimizer='sgd', loss='mse') import numpy as np a_data = np.asarray([[0, 1], [1, 0], [0, 1], [1, 0]]) b_data = np.asarray([[1, 0], [0, 1], [1, 0], [0, 1]]) targets = np.asarray([1, 1, 1, 1]) model.fit([a_data, b_data], [targets], nb_epoch=1000) print(model.layers[2].W.get_value())
Running this, we end up with a weight matrix that looks like
[[0.16537911 1.26961863] [ 1.06261277 0.1144496 ]]
This looks a bit like cosine similarity, but the scaling seems off. Cosine similarity is ambivalent about the magnitude of vectors, so the weight matrix ends up not being a rotation matrix so much as a rotationandskew matrix. It is interesting to think about why this network learned this particular matrix.
Below, a unit square (blue) is multiplied by the first matrix to get the orange square, and by the second matrix to get the yellow square.
Other Similarity Metrics
Feng et. al. provided a list of similarities along with their benchmarks for a CNN architecture. Some of these similarities, along with their implementations in Keras, are reproduced below. They rely on these helper functions:
from keras import backend as K axis = lambda a: len(a._keras_shape)  1 dot = lambda a, b, axis: K.batch_dot(a, b, axes=axis) l2_norm = lambda a, b: K.sqrt(K.sum((a  b) ** 2, axis=axis(a), keepdims=True))
If the function requires extra parameters, they are usually supplied as arguments in a dictionary.
Cosine
def cosine(x): return dot(x[0], x[1]) / K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1]))
Polynomial
def polynomial(x): return (params['gamma'] * dot(x[0], x[1]) + params['c']) ** params['d']
Values for gamma
used in the paper were [0.5, 1.0, 1.5]
. The value for c
was usually 1
. Values for d
were [2, 3]
.
Sigmoid
def sigmoid(x): return K.tanh(params['gamma'] * dot(x[0], x[1]) + params['c'])
Values for gamma
used in the paper were [0.5, 1.0, 1.5]
, and c
was 1
.
RBF
RBF stands for radial basis function.
def rbf(x): return K.exp(1 * params['gamma'] * l2_norm(x[0], x[1]) ** 2)
Values for gamma
used in the paper were [0.5, 1.0, 1.5]
.
Euclidean
def euclidean(x): return 1 / (1 + l2_norm(x[0], x[1]))
Exponential
def exponential(x): return K.exp(1 * params['gamma'] * l2_norm(x[0], x[1]))
GESD
This was a custom metric developed by the authors which stands for Geometric mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.
def gesd(x): euclidean = 1 / (1 + l2_norm(x[0], x[1])) sigmoid = 1 / (1 + K.exp(1 * params['gamma'] * (dot(x[0], x[1]) + params['c']))) return euclidean * sigmoid
Values for gamma
used were [0.5, 1.0, 1.5]
and c
was 1
.
AESD
This was a custom metric developed by the authors which stands for Arithmetic mean of Euclidean and Sigmoid Dot product. It performed well for their benchmarks.
def gesd(x): euclidean = 0.5 / (1 + l2_norm(x[0], x[1])) sigmoid = 0.5 / (1 + K.exp(1 * params['gamma'] * (dot(x[0], x[1]) + params['c']))) return euclidean + sigmoid
Values for gamma
used were [0.5, 1.0, 1.5]
and c
was 1
.
InsuranceQA Model Example
A surprisingly good model for the InsuranceQA dataset is as follows:
A framework for designing and testing models can be found in this Github repo . This model achieved relatively good marks for Top1 Accuracy (how often did the model rank a ground truth the highest out of 500 results) and Mean Reciprocal Rank (MRR), which is defined as
The results after learning the training set are summaraized in the following table.
Test set  Top1 Accuracy  Mean Reciprocal Rank 

Test 1  0.4933  0.6189 
Test 2  0.4606  0.5968 
Dev  0.4700  0.6088 
For comparison, the best model from Feng et. al. achieved an accuracy of 0.653 on Test 1, and the model in Tan et. al. achieved an accuracy of 0.681 on Test 1. This model isn’t exceptional, but it works pretty well for how simple it is. It outperforms the baseline bag of words model, and performs on par with the MetzlerBendersky IR model introduced in “Learning concept importance using a weighted dependence model” ( Bendersky and Metzler, 2010 ). Here’s how we build it in Keras:
def build(): input = Input(shape=(sentence_length,)) # embedding embedding = Embedding(n_words, n_embed_dims) input_embedding = embedding(input) # dropout dropout = Dropout(0.5) input_dropout = dropout(input_embedding) # maxpooling maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2])) input_pool = maxpool(input_dropout) # activation activation = Activation('tanh') output = activation(input_pool) model = Model(input=[input], output=[output]) model = build() question, answer = Input(shape=(q_len,)), Input(shape=(a_len,)) question_output = model(question) answer_output = model(answer) similarity = merge([question_output, answer_output], mode='cos', dot_axes=1) model = Merge([question_output, answer_output], [similarity])
The code is kind of awkward without the context, so I would recommend checking out the repository to see how it works.
Closing Remarks
Hopefully this demonstrates that Keras is powerful and flexible enough to allow for quick and creative implementations of networks. This post follows my final project for my Information Retrieval class, the code for which can be seen here . I think this code makes more sense in the context of this post.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Deep Language Modeling for Question Answering Using Keras