**Deep matrix factorization using Apache MXNet**(notes from Oreilly, github notebook)

Recommendation engines are widely used models that attempt to identify items that a person will like based on that person’s past behavior. We’re all familiar with Amazon’s recommendations based on your past purchasing history, and Netflix recommending shows to you based on your history and the ratings you’ve given other shows.

The Netflix Prize is likely the most famous example many data scientists explored in detail for matrix factorization. Simply put, the challenge was to predict a viewer’s rating of shows that they hadn’t yet watched based on their ratings of shows that they had watched in the past, as well as the ratings that other viewers had given a show. This takes the form of a matrix where the rows are users, the columns are shows, and the values in the matrix correspond to the rating of the show. Most of the values in this matrix are missing, as users have not rated (or seen) the majority of shows.

The best techniques for filling in these missing values can vary depending on how sparse the matrix is. If there are not many missing values, frequently t

**he mean or median value**of the observed values can be used to fill in the missing values. If the data is categorical, as opposed to numeric, the m

**ost frequently observed value**is used to fill in the missing values.

This technique, while simple and fast, is fairly naive, as it misses the interactions that can happen between the variables in a data set. Most egregiously, if the data is numeric and bimodal between two distant values, the mean imputation method can create values that otherwise would

**not exist in the data set**at all. These improper imputed values increase the noise in the data set the sparser it is, so this becomes an increasingly bad strategy as sparsity increases.

Matrix factorization is a simple idea that tries to learn connections between the known values in order to impute the missing values in a smart fashion. Simply put, for each row and for each column, it learns (k) numeric “factors” that represent the row or column.

This would line up the user factor “does this user like comedies?” with the movie factor “is this a comedy?” so we can predict the user like that movie given both are answered "yes". In the general formula:

user_i_rating_movie_j = sum of (user_i_k * movie_j,k) for all possible factor of k.

If it's residential properties in the US, those k factors would be something like:

1. The user i likes/visited a lot the houses with big backyard, and the house j has a big backyeard.

2. the user i likes/visited a lot the houses with downstairs bedroom, and the house j has downstairs bedroom.

... It is impractical to always pre-assign meaning to these factors at large scale; rather, we need to learn the values directly from data. Once these factors are learned using the observational data, a simple dot product between the matrices can allow one to fill in all of the missing values. If two movies are similar to each other, they will likely have similar factors, and their corresponding matrix columns will have similar values. The upside to this approach is that an informative set of factors can be learned automatically from data, but the downside is that the meaning of these factors might not be easily interpretable.

**Matrix factorization is a linear method**, meaning that if there are complicated non-linear interactions going on in the data set, a simple dot product may not be able to handle it well. Given the recent success of deep learning in complicated non-linear computer vision and natural language processing tasks, it is natural to want to find a way to incorporate it into matrix factorization as well.

A way to do this is called “

**deep matrix factorization**” and involves the replacement of the dot product with a neural network that is trained jointly with the factors. This makes the model more powerful because a neural network can model

**important non-linear combinations**of factors to make better predictions.

In traditional matrix factorization, the prediction is the simple dot product between the factors for each of the dimensions. In contrast, in deep matrix factorization the factors for both are concatenated together and used as the input to a neural network whose output is the prediction. The parameters in the neural network are then trained jointly with the factors to produce a sophisticated non-linear model for matrix factorization.

This tutorial will first introduce both traditional and deep matrix factorization by using them to model synthetic data.

%pylab inline import mxnet as mx import pandas import seaborn; seaborn.set_style('whitegrid') import logging logger = logging.getLogger() logger.setLevel(logging.DEBUG) X = numpy.random.randn(250, 250) n = 35000 i = numpy.random.randint(250, size=n) # Generate random row indexes j = numpy.random.randint(250, size=n) # Generate random column indexes X_values = X[i, j] # Extract those values from the matrix print(X_values.shape) plt.title("Distribution of Data in Matrix", fontsize=16) plt.ylabel("Count", fontsize=14) plt.xlabel("Value", fontsize=14) plt.hist(X_values, bins=100) plt.show()A core component of the matrix factorization model will be the embedding layer. This layer takes in an integer and outputs a dense array of learned features. This is widely used in natural language processing, where the input would be words and the output might be the features related to those words’ sentiment. The strength of this layer is the ability for the network to learn what useful features are instead of needing them to be predefined.

user = mx.symbol.Variable("user") # Name one of our input variables, the user index # Define an embedding layer that takes in a user index and outputs a dense 25 dimensional vector user = mx.symbol.Embedding(data=user, input_dim=250, output_dim=25) movie = mx.symbol.Variable("movie") # Name the other input variable, the movie index # Define an embedding layer that takes in a movie index and outputs a dense 25 dimensional vector movie = mx.symbol.Embedding(data=movie, input_dim=250, output_dim=25) # Name the output variable. "softmax_label" is the default name for outputs in mxnet y_true = mx.symbol.Variable("softmax_label") # Define the dot product between the two variables, which is the elementwise multiplication and a sum y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1) y_pred = mx.symbol.flatten(y_pred) # The linear regression output defines the loss we want to use to train the network, mse in this case y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true)We just implemented vanilla matrix factorization! It is a fairly simple model when expressed this way. All we need to do is define an embedding layer for the users and one for the movies, then define the dot product between these two layers as the prediction. Keep in mind that while we’ve used MXNet to implement this model, it does not yet utilize a deep neural network.

Since we have multiple inputs (the movie and the user), the NDArrayIter object is the most convenient as it can handle arbitrary inputs and outputs through the use of dictionaries.

# Build a data iterator for training data using the first 25,000 samples X_train = mx.io.NDArrayIter({'user': i[:25000], 'movie': j[:25000]}, label=X_values[:25000], batch_size=1000) # Build a data iterator for evaluation data using the last 10,000 samples X_eval = mx.io.NDArrayIter({'user': i[25000:], 'movie': j[25000:]}, label=X_values[25000:], batch_size=1000)We aren’t quite done yet. We’ve only defined the symbolic architecture of the network. We now need to specify that the architecture is a model to be trained using the Module wrapper. You can use multiple GPUs by passing in a list such as [mx.gpu(0), mx.gpu(1)]. Given that the examples we are dealing with in this tutorial are fairly simple and don’t involve convolutional layers, it likely won’t be very beneficial to use multiple GPUs.

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, eval_metric='rmse', eval_data=X_eval)Let’s now create some synthetic data that has a structure that can be exploited instead of evenly distributed random values. We can do this by first randomly generating two lower-rank matrices, one for the rows and one for the columns, and taking the dot product between them.

a = numpy.random.normal(0, 1, size=(250, 25)) # Generate random numbers for the first skinny matrix b = numpy.random.normal(0, 1, size=(25, 250)) # Generate random numbers for the second skinny matrix X = a.dot(b) # Build our real data matrix from the dot product of the two skinny matrices n = 35000 i = numpy.random.randint(250, size=n) j = numpy.random.randint(250, size=n) X_values = X[i, j] X_train = mx.io.NDArrayIter({'user': i[:25000], 'movie': j[:25000]}, label=X_values[:25000], batch_size=100) X_eval = mx.io.NDArrayIter({'user': i[25000:], 'movie': j[25000:]}, label=X_values[25000:], batch_size=100) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) #model.fit(X_train, num_epoch=5, eval_metric='mse', eval_data=X_eval) model.fit(X_train, num_epoch=5,It doesn’t seem like we’re learning anything again because neither the training nor the validation mse goes down epoch by epoch. The problem this time is that, like many matrix factorization algorithms, we are using stochastic gradient descent (SGD), which is tricky to set an appropriate learning rate for. However, since we’ve implemented matrix factorization using Apache MXNet we can easily use a different optimizer. Adam is a popular optimizer that can automatically tune the learning rate to get better results, and we can specify that we want to use it by adding in optimizer='adam' to the fit function.optimizer='adam', eval_metric='mse', eval_data=X_eval)

Let’s now take a look at what deep matrix factorization would look like. Essentially, we want to replace the dot product that turns the factors into a prediction with a neural network that takes in the factors as input and predicts the output. We can modify the model code fairly simply to achieve this.

user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=250, output_dim=25) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=250, output_dim=25) y_true = mx.symbol.Variable("softmax_label") # No longer using a dot product #y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1) #y_pred = mx.symbol.flatten(y_pred) #y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true) # Instead of taking the dot product we want to concatenate the inputs together nn = mx.symbol.concat(user, movie) nn = mx.symbol.flatten(nn) # Now we pass both the movie and user factors into a one layer neural network nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) # We use a ReLU activation function here, but one could use any type including PReLU or sigmoid nn = mx.symbol.Activation(data=nn, act_type='relu') # Now we define our output layer, a dense layer with a single neuron containing the prediction nn = mx.symbol.FullyConnected(data=nn, num_hidden=1) # We don't put an activation on the prediction here because it is the output of the model y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='mse', eval_data=X_eval)The only difference to our code is that we have to define a neural network. The inputs to this network are the concatenated factors from the embedding layer, created using mx.symbol.concat. Next, we flatten the result just to ensure the right shape using mx.symbol.flatten, and then use mx.symbol.FullyConnected and mx.symbol.Activation layers as necessary to define the network. The final fully connected layer must have a single node, as this is the final prediction that the network is making.

Now let’s move on to a real example using the MovieLens 20M data set. This data is comprised of movie ratings from the MovieLens site (https://movielens.org/), a site that will predict what other movies you will like after seeing you rate movies. The MovieLens 20M data set is a sampling of about 20 million ratings from about 138,000 users on about 27,000 movies. The ratings range from 0.5 to 5 stars in 0.5 star increments. (At least 6GB of RAM is suggested for running the cells.)

import os import urllib.request import zipfile # If we don't have the data yet, download it from the appropriate source if not os.path.exists('ml-20m.zip'): urllib.request.urlretrieve('http://files.grouplens.org/datasets/movielens/ml-20m.zip', 'ml-20m.zip') # Now extract the data since we know we have it at this point with zipfile.ZipFile("ml-20m.zip", "r") as f: f.extractall("./") # Now load it up using a pandas dataframe data = pandas.read_csv('./ml-20m/ratings.csv', sep=',', usecols=(0, 1, 2)) data.head()Let’s next quickly look at the users and movies. Specifically, let’s look at the maximum and minimum ID values, and the number of unique users and movies.

print("user id min/max: ", data['userId'].min(), data['userId'].max()) print("# unique users: ", numpy.unique(data['userId']).shape[0]) print("") print("movie id min/max: ", data['movieId'].min(), data['movieId'].max()) print("# unique movies: ", numpy.unique(data['movieId']).shape[0])It looks like the max user ID is equal to the number of unique users, but this is not the case for the number of movies.

We can quickly estimate the sparsity of the MovieLens 20M data set using these numbers. If there are ~138K unique users and ~27K unique movies, then there are ~3.7 billion entries in the matrix. Since only ~20M of these are present, ~99.5% of the matrix is missing.

n = 19000000 data = data.sample(frac=1).reset_index(drop=True) # Shuffle the data in place row-wise # Use the first 19M samples to train the model train_users = data['userId'].values[:n] - 1 # Offset by 1 so that the IDs start at 0 train_movies = data['movieId'].values[:n] - 1 # Offset by 1 so that the IDs start at 0 train_ratings = data['rating'].values[:n] # Use the remaining ~1M samples for validation of the model valid_users = data['userId'].values[n:] - 1 # Offset by 1 so that the IDs start at 0 valid_movies = data['movieId'].values[n:] - 1 # Offset by 1 so that the IDs start at 0 valid_ratings = data['rating'].values[n:] X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, label=train_ratings, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, label=valid_ratings, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25) y_true = mx.symbol.Variable("softmax_label") y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1) y_pred = mx.symbol.flatten(y_pred) y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))It looks like we’re learning something on this data set! We can see that both the training and the validation RMSE decrease epoch by epoch. Each epoch takes significantly longer than with the synthetic data because we’re training on 19 million examples instead of 25,000.

Let’s now turn to deep matrix factorization. We’ll use similar code as before, where we concatenate together the two embedding layers and use that as input to two fully connected layers. This network will have the same sized embedding layers as the normal matrix factorization and optimization parameters, meaning the only change is that a neural network is doing the prediction instead of a dot product.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, label=train_ratings, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, label=valid_ratings, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25) y_true = mx.symbol.Variable("softmax_label") nn = mx.symbol.concat(user, movie) nn = mx.symbol.flatten(nn) # Since we are using a two layer neural network here, we will create two FullyConnected layers # with activation functions before the output layer nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=1) y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))It is important to keep in mind we are training a neural network, and so all of the advances in deep learning can be immediately applied to deep matrix factorization. A widely used advance was that of batch normalization, which essentially tried to shrink the range of values that the internal nodes in a network, ultimately speeding up training. We can use batch normalization layers in our network here as well simply by modifying the network to add these layers in between the fully connected layers and the activation layers.

It seems that for this specific data set, batch normalization does cause the network to converge faster, but does not produce a significantly better model.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, label=train_ratings, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, label=valid_ratings, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25) y_true = mx.symbol.Variable("softmax_label") nn = mx.symbol.concat(user, movie) nn = mx.symbol.flatten(nn) nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.An alternate approach to this model is to consider the problem as a classification problem instead of a regression problem. In the classification approach, the classes are the various ratings, and one attempts a 10 class problem, since there are 10 possible ratings that a film can get. The only modification to the data needed is to modify the ratings to be integers between 0 and 9 instead of 0 to 5 with 0.5 star spacing.BatchNorm(data=nn) # First batch norm layer here, before the activaton! nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.BatchNorm(data=nn) # Second batch norm layer here, before the activation! nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=1) y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))

The network needs only be modified to have 10 hidden units in the final layer, since it is now a 10 class problem, and to use the softmax output designed for classification problems instead of the linear regression output designed for regression problems. Lastly, since this is a classification problem, we want to use accuracy as the evaluation metric instead of RMSE.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, label=train_ratings*2-1, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, label=valid_ratings*2-1, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25) y_true = mx.symbol.Variable("softmax_label") nn = mx.symbol.concat(user, movie) nn = mx.symbol.flatten(nn) nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=10) # 10 hidden units because 10 classes #y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) # SoftmaxOutput instead of LinearRegressionOutput because this is a classification problem now # and we want to use a classification loss function instead of a regression loss function y_pred = mx.symbol.SoftmaxOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='acc', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))

**Structural regularization in deep matrix factorization**

Another aspect of deep matrix factorization’s flexibility over that of vanilla matrix factorization is the output_dim size of the embedding layers. For vanilla matrix factorization, these values must be the same for both inputs because the prediction is the dot product between the two. However, if these serve as the input for a neural network, this restriction goes away. This is useful in the case where one dimension may be significantly larger than the other and, thus, requires training a massive number of factors. In the MovieLens case, there are significantly more users (~138K) than there are movies (~27K). By changing the number of user factors from 25 to 15, we can reduce the number of parameters by 1.38 million while not losing any expressivity on the movie side. The only change is changing the value of output_dim in the user embedding layer.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, label=train_ratings, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, label=valid_ratings, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=15) #It seems we are getting similar accuracy with drastically fewer parameters. This is useful in a setting where you may run out of memory due to the size of the matrix being completed, or as a way to reduce overfitting by using a simpler model.Using 15 instead of 25 heremovie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25) y_true = mx.symbol.Variable("softmax_label") nn = mx.symbol.concat(user, movie) nn = mx.symbol.flatten(nn) nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=1) y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))

Next, we can extend matrix factorization using only two embedding layers. The MovieLens data set comes with genres for each of the films. For instance, the movie "Waiting to Exhale (1995)", the corresponding genres is: "Comedy|Drama|Romance".

or simplicity, let’s only use the first genre of the many that are specified, and determine a unique ID for each of the genres.

labels_str = [label.split("|")[0] for label in genres['genres']] label_set = numpy.unique(labels_str) label_idxs = {l: i for i, l in enumerate(label_set)} label_idxs labels = numpy.empty(n_movies) for movieId, label in zip(genres['movieId'], labels_str): labels[movieId-1] = label_idxs[label] train_genres = numpy.array([labels[int(j)] for j in train_movies]) valid_genres = numpy.array([labels[int(j)] for j in valid_movies]) train_genres[:10] mx.io.NDArrayIter({'user': train_users, 'movie': train_movies, 'movie_genre': train_genres}, label=train_ratings, batch_size=batch_size) X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies, 'movie_genre': valid_genres}, label=valid_ratings, batch_size=batch_size) user = mx.symbol.Variable("user") user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=15) movie = mx.symbol.Variable("movie") movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=20) # Reduce from 25 to 20 # We need to add in a third embedding layer for genreWhile it doesn’t appear that using only the first genre has led to much improvement on this data set, it demonstrates the types of things that one could do with the flexibility afforded by deep matrix factorization.movie_genre = mx.symbol.Variable("movie_genre") movie_genre = mx.symbol.Embedding(data=movie_genre, input_dim=20, output_dim=5) # Set to 5y_true = mx.symbol.Variable("softmax_label") nn = mx.symbol.concat(user, movie, movie_genre) nn = mx.symbol.flatten(nn) nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=64) nn = mx.symbol.Activation(data=nn, act_type='relu') nn = mx.symbol.FullyConnected(data=nn, num_hidden=1) y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true) model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie', 'movie_genre'), symbol=y_pred) model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))