神刀安全网

Deep learning based solution for Netflix challenge beats prize winning entry

If you are familiar with the Netflix prize challenge, you would remember the final solution that got the 1M prize.. It was a mix of solutions from a few winning teams and probably was not the most elegant of solutions.  There were some reports that Netflix did not use the final solution.

Here is a much simpler Neural network based solution that beats the top result on a validation set carved from the original dataset. and should not take more than 3 hours to run and few minutes to code.

Code for the model as implemented in Keras

movie_count = 17771 user_count = 2649430 model_left = Sequential() model_left.add(Embedding(movie_count, 60, input_length=1)) model_right = Sequential() model_right.add(Embedding(user_count, 20, input_length=1)) model = Sequential() model.add(Merge([model_left, model_right], mode='concat')) model.add(Flatten()) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adadelta')

In short there are 2 embeddings a 60 dimensional embedding for movies and a 20 dimensional embedding for each of the users. The embeddings are concatenated to form a single 80 dimensional input vector. The network learns a mapping between this embedding vector and the rating.  The model is fit as:

model.fit([tr[:,0].reshape((L,1)), tr[:,1].reshape((L,1))], tr[:,2].reshape((L,1)), batch_size=24000, nb_epoch=42, validation_data=([ ts[:,0].reshape((M,1)), ts[:,0].reshape((M,1))], ts[:,2].reshape((M,1))))

In the above code, tr is the training data, which are triples of movie_id, user_id, rating. L is 90432456 and M is 10048051, (around 90%, 10%) split for training and validation data. Another thing to note is that the data set is randomly shuffled before split and the training is done on the random set.

After about 40 epochs you should see the validation error around: 0.7006 or a root mean squared error of 0.837.  This is around 2% better than the million dollar prize winning error rate of 0.8553.  

Epoch 39/64 90432456/90432456 [==============================] - 1183s - loss: 0.6475 - val_loss: 0.7006

This model does not consider the rating date. The results can probably improved by using all available data.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Deep learning based solution for Netflix challenge beats prize winning entry

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮