Sunday, September 19, 2021

Data Science Study Notes: recommendation engine notes 1: Deep matrix factorization using Apache MXNet

Deep matrix factorization using Apache MXNet(notes from Oreilly, github notebook)

Recommendation engines are widely used models that attempt to identify items that a person will like based on that person’s past behavior. We’re all familiar with Amazon’s recommendations based on your past purchasing history, and Netflix recommending shows to you based on your history and the ratings you’ve given other shows.

The Netflix Prize is likely the most famous example many data scientists explored in detail for matrix factorization. Simply put, the challenge was to predict a viewer’s rating of shows that they hadn’t yet watched based on their ratings of shows that they had watched in the past, as well as the ratings that other viewers had given a show. This takes the form of a matrix where the rows are users, the columns are shows, and the values in the matrix correspond to the rating of the show. Most of the values in this matrix are missing, as users have not rated (or seen) the majority of shows.

The best techniques for filling in these missing values can vary depending on how sparse the matrix is. If there are not many missing values, frequently the mean or median value of the observed values can be used to fill in the missing values. If the data is categorical, as opposed to numeric, the most frequently observed value is used to fill in the missing values.

This technique, while simple and fast, is fairly naive, as it misses the interactions that can happen between the variables in a data set. Most egregiously, if the data is numeric and bimodal between two distant values, the mean imputation method can create values that otherwise would not exist in the data set at all. These improper imputed values increase the noise in the data set the sparser it is, so this becomes an increasingly bad strategy as sparsity increases.

Matrix factorization is a simple idea that tries to learn connections between the known values in order to impute the missing values in a smart fashion. Simply put, for each row and for each column, it learns (k) numeric “factors” that represent the row or column.

This would line up the user factor “does this user like comedies?” with the movie factor “is this a comedy?” so we can predict the user like that movie given both are answered "yes". In the general formula:

user_i_rating_movie_j = sum of (user_i_k * movie_j,k) for all possible factor of k.

If it's residential properties in the US, those k factors would be something like:
1. The user i likes/visited a lot the houses with big backyard, and the house j has a big backyeard.
2. the user i likes/visited a lot the houses with downstairs bedroom, and the house j has downstairs bedroom.

... It is impractical to always pre-assign meaning to these factors at large scale; rather, we need to learn the values directly from data. Once these factors are learned using the observational data, a simple dot product between the matrices can allow one to fill in all of the missing values. If two movies are similar to each other, they will likely have similar factors, and their corresponding matrix columns will have similar values. The upside to this approach is that an informative set of factors can be learned automatically from data, but the downside is that the meaning of these factors might not be easily interpretable.

Matrix factorization is a linear method, meaning that if there are complicated non-linear interactions going on in the data set, a simple dot product may not be able to handle it well. Given the recent success of deep learning in complicated non-linear computer vision and natural language processing tasks, it is natural to want to find a way to incorporate it into matrix factorization as well.

A way to do this is called “deep matrix factorization” and involves the replacement of the dot product with a neural network that is trained jointly with the factors. This makes the model more powerful because a neural network can model important non-linear combinations of factors to make better predictions.

In traditional matrix factorization, the prediction is the simple dot product between the factors for each of the dimensions. In contrast, in deep matrix factorization the factors for both are concatenated together and used as the input to a neural network whose output is the prediction. The parameters in the neural network are then trained jointly with the factors to produce a sophisticated non-linear model for matrix factorization.

This tutorial will first introduce both traditional and deep matrix factorization by using them to model synthetic data.
%pylab inline
import mxnet as mx
import pandas
import seaborn; seaborn.set_style('whitegrid')
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

X = numpy.random.randn(250, 250)
n = 35000
i = numpy.random.randint(250, size=n) # Generate random row indexes
j = numpy.random.randint(250, size=n) # Generate random column indexes
X_values = X[i, j] # Extract those values from the matrix
print(X_values.shape)

plt.title("Distribution of Data in Matrix", fontsize=16)
plt.ylabel("Count", fontsize=14)
plt.xlabel("Value", fontsize=14)
plt.hist(X_values, bins=100)
plt.show()
A core component of the matrix factorization model will be the embedding layer. This layer takes in an integer and outputs a dense array of learned features. This is widely used in natural language processing, where the input would be words and the output might be the features related to those words’ sentiment. The strength of this layer is the ability for the network to learn what useful features are instead of needing them to be predefined.
user = mx.symbol.Variable("user") # Name one of our input variables, the user index
# Define an embedding layer that takes in a user index and outputs a dense 25 dimensional vector
user = mx.symbol.Embedding(data=user, input_dim=250, output_dim=25) 

movie = mx.symbol.Variable("movie") # Name the other input variable, the movie index
# Define an embedding layer that takes in a movie index and outputs a dense 25 dimensional vector
movie = mx.symbol.Embedding(data=movie, input_dim=250, output_dim=25)

# Name the output variable. "softmax_label" is the default name for outputs in mxnet
y_true = mx.symbol.Variable("softmax_label")

# Define the dot product between the two variables, which is the elementwise multiplication and a sum
y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1)
y_pred = mx.symbol.flatten(y_pred)

# The linear regression output defines the loss we want to use to train the network, mse in this case
y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true)
We just implemented vanilla matrix factorization! It is a fairly simple model when expressed this way. All we need to do is define an embedding layer for the users and one for the movies, then define the dot product between these two layers as the prediction. Keep in mind that while we’ve used MXNet to implement this model, it does not yet utilize a deep neural network.

Since we have multiple inputs (the movie and the user), the NDArrayIter object is the most convenient as it can handle arbitrary inputs and outputs through the use of dictionaries.
# Build a data iterator for training data using the first 25,000 samples
X_train = mx.io.NDArrayIter({'user': i[:25000], 'movie': j[:25000]}, label=X_values[:25000], batch_size=1000)

# Build a data iterator for evaluation data using the last 10,000 samples
X_eval = mx.io.NDArrayIter({'user': i[25000:], 'movie': j[25000:]}, label=X_values[25000:], batch_size=1000)
We aren’t quite done yet. We’ve only defined the symbolic architecture of the network. We now need to specify that the architecture is a model to be trained using the Module wrapper. You can use multiple GPUs by passing in a list such as [mx.gpu(0), mx.gpu(1)]. Given that the examples we are dealing with in this tutorial are fairly simple and don’t involve convolutional layers, it likely won’t be very beneficial to use multiple GPUs.
model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, eval_metric='rmse', eval_data=X_eval)
Let’s now create some synthetic data that has a structure that can be exploited instead of evenly distributed random values. We can do this by first randomly generating two lower-rank matrices, one for the rows and one for the columns, and taking the dot product between them.
a = numpy.random.normal(0, 1, size=(250, 25)) # Generate random numbers for the first skinny matrix
b = numpy.random.normal(0, 1, size=(25, 250)) # Generate random numbers for the second skinny matrix

X = a.dot(b) # Build our real data matrix from the dot product of the two skinny matrices

n = 35000
i = numpy.random.randint(250, size=n)
j = numpy.random.randint(250, size=n)
X_values = X[i, j]

X_train = mx.io.NDArrayIter({'user': i[:25000], 'movie': j[:25000]}, label=X_values[:25000], batch_size=100)
X_eval = mx.io.NDArrayIter({'user': i[25000:], 'movie': j[25000:]}, label=X_values[25000:], batch_size=100)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
#model.fit(X_train, num_epoch=5, eval_metric='mse', eval_data=X_eval)
model.fit(X_train, num_epoch=5, optimizer='adam', eval_metric='mse', eval_data=X_eval)
It doesn’t seem like we’re learning anything again because neither the training nor the validation mse goes down epoch by epoch. The problem this time is that, like many matrix factorization algorithms, we are using stochastic gradient descent (SGD), which is tricky to set an appropriate learning rate for. However, since we’ve implemented matrix factorization using Apache MXNet we can easily use a different optimizer. Adam is a popular optimizer that can automatically tune the learning rate to get better results, and we can specify that we want to use it by adding in optimizer='adam' to the fit function.

Let’s now take a look at what deep matrix factorization would look like. Essentially, we want to replace the dot product that turns the factors into a prediction with a neural network that takes in the factors as input and predicts the output. We can modify the model code fairly simply to achieve this.
user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=250, output_dim=25)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=250, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")

# No longer using a dot product
#y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1)
#y_pred = mx.symbol.flatten(y_pred)
#y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true)

# Instead of taking the dot product we want to concatenate the inputs together
nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)

# Now we pass both the movie and user factors into a one layer neural network
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)

# We use a ReLU activation function here, but one could use any type including PReLU or sigmoid
nn = mx.symbol.Activation(data=nn, act_type='relu')

# Now we define our output layer, a dense layer with a single neuron containing the prediction
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)

# We don't put an activation on the prediction here because it is the output of the model
y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), 
          eval_metric='mse', eval_data=X_eval) 
The only difference to our code is that we have to define a neural network. The inputs to this network are the concatenated factors from the embedding layer, created using mx.symbol.concat. Next, we flatten the result just to ensure the right shape using mx.symbol.flatten, and then use mx.symbol.FullyConnected and mx.symbol.Activation layers as necessary to define the network. The final fully connected layer must have a single node, as this is the final prediction that the network is making.

Now let’s move on to a real example using the MovieLens 20M data set. This data is comprised of movie ratings from the MovieLens site (https://movielens.org/), a site that will predict what other movies you will like after seeing you rate movies. The MovieLens 20M data set is a sampling of about 20 million ratings from about 138,000 users on about 27,000 movies. The ratings range from 0.5 to 5 stars in 0.5 star increments. (At least 6GB of RAM is suggested for running the cells.)
import os
import urllib.request
import zipfile

# If we don't have the data yet, download it from the appropriate source
if not os.path.exists('ml-20m.zip'):
    urllib.request.urlretrieve('http://files.grouplens.org/datasets/movielens/ml-20m.zip', 'ml-20m.zip')

# Now extract the data since we know we have it at this point
with zipfile.ZipFile("ml-20m.zip", "r") as f:
    f.extractall("./")

# Now load it up using a pandas dataframe
data = pandas.read_csv('./ml-20m/ratings.csv', sep=',', usecols=(0, 1, 2))
data.head()
 
Let’s next quickly look at the users and movies. Specifically, let’s look at the maximum and minimum ID values, and the number of unique users and movies.
print("user id min/max: ", data['userId'].min(), data['userId'].max())
print("# unique users: ", numpy.unique(data['userId']).shape[0])
print("")
print("movie id min/max: ", data['movieId'].min(), data['movieId'].max())
print("# unique movies: ", numpy.unique(data['movieId']).shape[0]) 
It looks like the max user ID is equal to the number of unique users, but this is not the case for the number of movies.

We can quickly estimate the sparsity of the MovieLens 20M data set using these numbers. If there are ~138K unique users and ~27K unique movies, then there are ~3.7 billion entries in the matrix. Since only ~20M of these are present, ~99.5% of the matrix is missing.

n = 19000000
data = data.sample(frac=1).reset_index(drop=True) # Shuffle the data in place row-wise

# Use the first 19M samples to train the model
train_users = data['userId'].values[:n] - 1 # Offset by 1 so that the IDs start at 0
train_movies = data['movieId'].values[:n] - 1 # Offset by 1 so that the IDs start at 0
train_ratings = data['rating'].values[:n]

# Use the remaining ~1M samples for validation of the model
valid_users = data['userId'].values[n:] - 1 # Offset by 1 so that the IDs start at 0
valid_movies = data['movieId'].values[n:] - 1 # Offset by 1 so that the IDs start at 0
valid_ratings = data['rating'].values[n:]

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, 
                            label=train_ratings, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, 
                           label=valid_ratings, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")
y_pred = mx.symbol.sum_axis(data=(user * movie), axis=1)
y_pred = mx.symbol.flatten(y_pred)
y_pred = mx.symbol.LinearRegressionOutput(data=y_pred, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))
 
It looks like we’re learning something on this data set! We can see that both the training and the validation RMSE decrease epoch by epoch. Each epoch takes significantly longer than with the synthetic data because we’re training on 19 million examples instead of 25,000.

Let’s now turn to deep matrix factorization. We’ll use similar code as before, where we concatenate together the two embedding layers and use that as input to two fully connected layers. This network will have the same sized embedding layers as the normal matrix factorization and optimization parameters, meaning the only change is that a neural network is doing the prediction instead of a dot product.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, 
                            label=train_ratings, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, 
                           label=valid_ratings, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")

nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)

# Since we are using a two layer neural network here, we will create two FullyConnected layers
# with activation functions before the output layer
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)

y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))
 
It is important to keep in mind we are training a neural network, and so all of the advances in deep learning can be immediately applied to deep matrix factorization. A widely used advance was that of batch normalization, which essentially tried to shrink the range of values that the internal nodes in a network, ultimately speeding up training. We can use batch normalization layers in our network here as well simply by modifying the network to add these layers in between the fully connected layers and the activation layers.

It seems that for this specific data set, batch normalization does cause the network to converge faster, but does not produce a significantly better model.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, 
                            label=train_ratings, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, 
                           label=valid_ratings, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")

nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.BatchNorm(data=nn) # First batch norm layer here, before the activaton!
nn = mx.symbol.Activation(data=nn, act_type='relu') 
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.BatchNorm(data=nn) # Second batch norm layer here, before the activation!
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)

y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250)) 
An alternate approach to this model is to consider the problem as a classification problem instead of a regression problem. In the classification approach, the classes are the various ratings, and one attempts a 10 class problem, since there are 10 possible ratings that a film can get. The only modification to the data needed is to modify the ratings to be integers between 0 and 9 instead of 0 to 5 with 0.5 star spacing.

The network needs only be modified to have 10 hidden units in the final layer, since it is now a 10 class problem, and to use the softmax output designed for classification problems instead of the linear regression output designed for regression problems. Lastly, since this is a classification problem, we want to use accuracy as the evaluation metric instead of RMSE.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, 
                            label=train_ratings*2-1, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, 
                           label=valid_ratings*2-1, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=25)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")

nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=10) # 10 hidden units because 10 classes

#y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)
# SoftmaxOutput instead of LinearRegressionOutput because this is a classification problem now
# and we want to use a classification loss function instead of a regression loss function
y_pred = mx.symbol.SoftmaxOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='acc', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))
Structural regularization in deep matrix factorization
Another aspect of deep matrix factorization’s flexibility over that of vanilla matrix factorization is the output_dim size of the embedding layers. For vanilla matrix factorization, these values must be the same for both inputs because the prediction is the dot product between the two. However, if these serve as the input for a neural network, this restriction goes away. This is useful in the case where one dimension may be significantly larger than the other and, thus, requires training a massive number of factors. In the MovieLens case, there are significantly more users (~138K) than there are movies (~27K). By changing the number of user factors from 25 to 15, we can reduce the number of parameters by 1.38 million while not losing any expressivity on the movie side. The only change is changing the value of output_dim in the user embedding layer.

X_train = mx.io.NDArrayIter({'user': train_users, 'movie': train_movies}, 
                            label=train_ratings, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies}, 
                           label=valid_ratings, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=15) 
# Using 15 instead of 25 here

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=25)

y_true = mx.symbol.Variable("softmax_label")

nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu') 
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)

y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))
It seems we are getting similar accuracy with drastically fewer parameters. This is useful in a setting where you may run out of memory due to the size of the matrix being completed, or as a way to reduce overfitting by using a simpler model.

Next, we can extend matrix factorization using only two embedding layers. The MovieLens data set comes with genres for each of the films. For instance, the movie "Waiting to Exhale (1995)", the corresponding genres is: "Comedy|Drama|Romance".

or simplicity, let’s only use the first genre of the many that are specified, and determine a unique ID for each of the genres.
labels_str = [label.split("|")[0] for label in genres['genres']]
label_set = numpy.unique(labels_str)
label_idxs = {l: i for i, l in enumerate(label_set)}
label_idxs

labels = numpy.empty(n_movies)
for movieId, label in zip(genres['movieId'], labels_str):
    labels[movieId-1] = label_idxs[label]

train_genres = numpy.array([labels[int(j)] for j in train_movies])
valid_genres = numpy.array([labels[int(j)] for j in valid_movies])
train_genres[:10]

mx.io.NDArrayIter({'user': train_users, 'movie': train_movies, 'movie_genre': train_genres}, 
                            label=train_ratings, batch_size=batch_size)
X_eval = mx.io.NDArrayIter({'user': valid_users, 'movie': valid_movies, 'movie_genre': valid_genres}, 
                           label=valid_ratings, batch_size=batch_size)

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=n_users, output_dim=15)

movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=n_movies, output_dim=20) # Reduce from 25 to 20

# We need to add in a third embedding layer for genre
movie_genre = mx.symbol.Variable("movie_genre")
movie_genre = mx.symbol.Embedding(data=movie_genre, input_dim=20, output_dim=5) # Set to 5

y_true = mx.symbol.Variable("softmax_label")

nn = mx.symbol.concat(user, movie, movie_genre)
nn = mx.symbol.flatten(nn)
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)

y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

model = mx.module.Module(context=mx.gpu(0), data_names=('user', 'movie', 'movie_genre'), symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),),
          eval_metric='rmse', eval_data=X_eval, batch_end_callback=mx.callback.Speedometer(batch_size, 250))
While it doesn’t appear that using only the first genre has led to much improvement on this data set, it demonstrates the types of things that one could do with the flexibility afforded by deep matrix factorization.

Monday, August 30, 2021

Data Science Study notes: Recommendation engine notes 2: collaborative filtering vs Content Base(Knowledge-Based)

Most recommender systems leverage two types of data:
1. Interaction Data, such as ratings, and browsing behaviors==> Collaborative Filtering method
2. Attribution Information, about each users and items ==> Content Base(Knowledge-Based) method

Hybrid Systems are then used to combined the advantages of these approaches to have a robust performing system across a wide variety of applications.

Intuitively, Collaborative Filtering Methods use the collaborative power of the ratings provided by multiple users to make recommendations and rely mostly on leveraging either inter-item correlations or inter-user interactions for the prediction process. It relies on an underlying notion that two users who rate items similarly are likely to have comparable preferences for other items.

Two types of methods from collaborative filtering:
Memory-based methods also referred to as neighborhood-based collaborative filtering algorithms, where ratings of user-item combinations are predicted based on their neighborhoods. These neighborhoods can be further defined as (1) User Based, (2) Item Based

Q: what is the basic principle that underlies the working of recommendation algorithms?
The basic principle of recommendations is that there are significant dependencies between user- and item-centric activity. For example, a user who is interested in hotels in New York City is more likely to be i nterested in other NYC hotels, rather than in Boston.

A recommender system, or a recommendation system, can be thought of as a subclass of information filtering system that seeks to predict the best “rating” or “preference”, that a user would give to an item, which is typically obtained by optimizing for objectives like total clicks, total revenue, and overall sales. For instance, which other car we should recommend for this user so that most likely they will submit a quote for that car!

The primary business goal of any recommender system is to provide users’ with a personalized experience! Either prediction problem(for other items) or Ranking problem(top k-item they like already or they might like).

LightFM is a Python implementation of several popular recommendation algorithms for both implicit and explicit feedback types. Importantly, it allows you to incorporate both item and user metadata into the traditional matrix factorization algorithms, making it possible to generalize to new items (via item features) and new users (via user features).

Here are some details for each of those 3 methods:
1. AWS Nepture is using the SQL/relationship way to get recommendation:
2. AWS EMR use the ALS algorithem to get recommendation(matrix factorization in Spark MLlib):
3. Use sagemaker for ML, DL aogorithem to get recommendation(deep matrix factorization, MXNET)


Saturday, August 28, 2021

AWS Study notes: AWS Glue data

AWS Glue usage:
1. to build a data warehouse to organize, clease,validate,and format data.
2. when you run serverless queries against Amazon S3 data lake
3. when you want to create event-driven ETL(extract,transform,load) pipelines.
4. to understand your datasets.
#Step 1: Crawlers: connect to a data store processes through a prioritied list of classifiers to determine the schema for your data, then create metadata in your data catalog. For example, you can create a new crawler that can crawl the s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv file, and can place the resulting metadata into a database named payments in the AWS Glue Data Catalog.


#Step 2: Add Boilerplate Script to the Development Endpoint Notebook 
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

#Step 3: Compare Different Schema Parsings 
medicare = spark.read.format(
   "com.databricks.spark.csv").option(
   "header", "true").option(
   "inferSchema", "true").load(
   's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()

medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
       database = "payments",
       table_name = "medicare")
medicare_dynamicframe.printSchema()

medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')])
medicare_res.printSchema()

medicare_res.toDF().where("'provider id' is NULL").show()
medicare_dataframe = medicare_res.toDF()
medicare_dataframe = medicare_dataframe.where("'provider id' is NOT NULL")

#Step 4: Map the Data and Use Apache Spark Lambda Functions 
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

chop_f = udf(lambda x: x[1:], StringType())
medicare_dataframe = medicare_dataframe.withColumn(
        "ACC", chop_f(
            medicare_dataframe["average covered charges"])).withColumn(
                "ATP", chop_f(
                    medicare_dataframe["average total payments"])).withColumn(
                        "AMP", chop_f(
                            medicare_dataframe["average medicare payments"]))
medicare_dataframe.select(['ACC', 'ATP', 'AMP']).show()
from awsglue.dynamicframe import DynamicFrame
medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested")
medicare_nest_dyf = medicare_tmp_dyf.apply_mapping([('drg definition', 'string', 'drg', 'string'),
                 ('provider id', 'long', 'provider.id', 'long'),
                 ('provider name', 'string', 'provider.name', 'string'),
                 ('provider city', 'string', 'provider.city', 'string'),
                 ('provider state', 'string', 'provider.state', 'string'),
                 ('provider zip code', 'long', 'provider.zip', 'long'),
                 ('hospital referral region description', 'string','rr', 'string'),
                 ('ACC', 'string', 'charges.covered', 'double'),
                 ('ATP', 'string', 'charges.total_pay', 'double'),
                 ('AMP', 'string', 'charges.medicare_pay', 'double')])
medicare_nest_dyf.printSchema()

#Step 5: Write the Data to Apache Parquet 
glueContext.write_dynamic_frame.from_options(
       frame = medicare_nest_dyf,
       connection_type = "s3",
       connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"},
       format = "parquet")
Data Catalog: Table definition, job definition.
Classifier: schema of the data.
Connection: JDBC connection.

Thursday, July 29, 2021

GCP Study Notes 8-a: Example of using Google dataflow & Apache Beam

In this tutorial, you'll learn the basics of the Cloud Dataflow service by running a simple example pipeline using the Apache Beam Python SDK. This pipeline will show you the basics of reading a text file from Google Cloud Storage, counting the number of unique words in the file, and finally writing the word counts back to Google Cloud Storage.

What's dataflow?
Auto-scale on the dataflow:
To use Dataflow, turn on the Cloud Dataflow APIs and open the Cloud Shell.

Dataflow runs jobs written using the Apache Beam SDK. To submit jobs to the Dataflow Service using Python, your development environment will require Python, the Google Cloud SDK, and the Apache Beam SDK for Python. Additionally, Cloud Dataflow uses pip3, Python's package manager, to manage SDK dependencies, and virtualenv to create isolated Python environments.

1. Install virtualenv and activate a Python virtual environment:
pip3 install --upgrade virtualenv --user

2. Create a Python virtual environment:
python3 -m virtualenv env

and activate it:
3. source env/bin/activate

In order to write a Python Dataflow job, you will first need to download the SDK from the repository. When you run this command, pip3 will download and install the appropriate version of the Apache Beam SDK.
4. pip3 install --quiet apache-beam[gcp]

5: to see the examples we are using:
pwd
~/env/lib/python3.7/site-packages

#list all the folders with beam:
ls |grep beam
pwd
/home/usename_id/env/lib/python3.7/site-packages/apache_beam/examples

Set up a Cloud Storage bucket, Cloud Dataflow uses Cloud Storage buckets to store output data and cache your pipeline code. In Cloud Shell, use the command gsutil mb to create a Cloud Storage bucket.

gsutil mb gs://symbol-idap-poc-de4d

In Cloud Dataflow, data processing work is represented by a pipeline. A pipeline reads input data, performs transformations on that data, and then produces output data. A pipeline's transformations might include filtering, grouping, comparing, or joining data.

Use Python to launch your pipeline on the Cloud Dataflow service. The running pipeline is referred to as a job:

python3 -m \ apache_beam.examples.wordcount \ --project symbol-idap-poc-pid \ --runner DataflowRunner \ --temp_location gs://symbol-idap-poc-gcs-userid/temp \ --output gs://symbol-idap-poc-gcs-userid/results/output \ --job_name dataflow-intro \ --region us-central1 python -m apache_beam.examples.wordcount \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://symbol-idap-poc-gcs-username/ouput \ --runner DataflowRunner \ --project symbol-idap-poc-de4d \ --region us-west1 \ --job_name dataflow-intro \ --service_account=tf-sybm-idap-poc@sybm-poc-pid.iam.gserviceaccount.com \ --subnetwork=https://www.googleapis.com/compute/v1/projects/sybm-network-sbx-nid/ \ regions/us-west1/subnetworks/sybm-app-us-w1-app-sbx-subnet \ --temp_location gs://symbol-idap-poc-gcs-username/temp/

--runner is the specific execution engine to use to run your pipeline. The DataflowRunner uses the Dataflow Service as the execution engine.
--temp_location is the storage bucket Cloud Dataflow will use for the binaries and other data for running your pipeline. This location can be shared across multiple jobs.
--output is the bucket used by the WordCount example to store the job results.
--job_name is a user-given unique identifier. Only one job may execute with the same name.

You might get some error message similar to the following:
"I wrote a simple python program 'guessing_game.py'. When I try to run this code in the command prompt using "python -m guessing_game.py" the program runs fine but in the end it says Error while finding module specification for 'guessing_game.py' (ModuleNotFoundError: path attribute not found on 'guessing_game' while trying to find 'guessing_game.py'). And when I run the same program using "python -guessing_game.py" it runs fine and it doesn't show that message as well."

The solution: change test.py to test, -m takes a module name, not a file path, and .py isn't part of the module name; so either python test.py or python -m test should be working. Since the argument is a module name, you must not give a file extension (.py). The module-name should be a valid Python module name.

python -m test \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://sybm-idap-poc-gcs-username/ouput \ --runner DataflowRunner \ --project sybm-idap-poc-decd \ --region us-west1 \ --service_account=tf-clgx-idap-poc@clgx-idap-poc-de4d.iam.gserviceaccount.com \ --subnetwork=https://www.googleapis.com/compute/v1/projects/sybm-network-sbx-77c3/ \ regions/us-west1/subnetworks/sybm-app-us-w1-app-sbx-subnet \ --temp_location gs://sybm-idap-poc-gcs-username/temp/

Use the following gcloud command-line tool command to view the dataflow service account:
gcloud iam roles describe roles/dataflow.serviceAgent

Controller service account: Pay attention to Terraform Service Accounts:
service_account=tf-nameid-poc@nameid-poc-de4d.iam.gserviceaccount.com
Terraform Service Accounts Module: This module allows easy creation of one or more service accounts, and granting them basic roles.

Compute Engine instances execute Apache Beam SDK operations in the cloud. These workers use your project’s controller service account to access your pipeline’s files and other resources. Dataflow also uses the controller service account to perform “metadata” operations, which don’t run on your local client or on Compute Engine workers. These operations perform tasks such as determining input sizes and accessing Cloud Storage files.

For the controller service account to be able to create, run, and examine a job, ensure that it has the roles/dataflow.admin and roles/dataflow.worker roles. In addition, the iam.serviceAccounts.actAs permission is required for your user account in order to impersonate the service account.

here are the example coding for the program test.py

import apache_beam as beam import argparse import logging import datetime, os from apache_beam.options.pipeline_options import PipelineOptions from datetime import datetime # change these to try this notebook out BUCKET = 'symb-idap-poc-gcs-bid' PROJECT = 'symb-idap-poc-de4d' REGION = 'us-west1' os.environ['BUCKET'] = BUCKET os.environ['PROJECT'] = PROJECT os.environ['REGION'] = REGION def compute_fit1(row): from scipy import stats import numpy as np durations = row['duration_array'] ag, bg, cg = stats.gamma.fit(durations) if np.isfinite(ag) and np.isfinite(bg) and np.isfinite(cg): result = {} result['cntycd'] = str(row['cntycd']) result['ag'] = ag result['bg'] = bg result['cg'] = cg yield result def run_job(): import shutil, os job_name = 'dataflow-test-' + datetime.now().strftime('%Y-%m-%d-%H-%M-%S') print("\n ******** start Time = ",datetime.now().strftime("%H:%M:%S")) print('\n Launching Dataflow job {} ... hang on'.format(job_name)) #parser = argparse.ArgumentParser() #args, beam_args = parser.parse_known_args() for i in range(11): globals() ["query"+str(i+1)] = " SELECT cntycd,count(*) as cnt, \ ARRAY_AGG(livsqftnbr) AS duration_array FROM `dataflow_test.data_new4` \ where cntycd is not null and livsqftnbr is not null \ GROUP BY cntycd having cnt > 1000 " beam_options = PipelineOptions( runner='DataflowRunner', job_name=job_name, project=PROJECT, region=REGION ) # 'requirements_file': 'requirements.txt' with beam.Pipeline(options = beam_options) as p: (p | 'read_bq1' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query1, use_standard_sql=True)) | 'compute_fit1' >> beam.FlatMap(compute_fit1) | 'write_bq1' >> beam.io.gcp.bigquery.WriteToBigQuery( 'dataflow_test.out1', schema='cntycd:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64', write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE) ) (p | 'read_bq2' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query2, use_standard_sql=True)) | 'compute_fit2' >> beam.FlatMap(compute_fit1) | 'write_bq2' >> beam.io.gcp.bigquery.WriteToBigQuery( 'dataflow_test.out2', schema='cntycd:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64', write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE) ) run_job() print("\n ******** end Time = ", datetime.now().strftime("%H:%M:%S")) print('Done')

Difference btw FlatMap transform and Map transform:
These transforms in Beam are exactly same as Spark (Scala too).
A Map transform, maps from a PCollection of N elements into another PCollection of N elements.
A FlatMap transform maps a PCollections of N elements into N collections of zero or more elements, which are then flattened into a single PCollection.

As a simple example, the following happens:

beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any']) # The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']] Whereas: beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any']) # The lists that are output by the lambda, are then flattened into a # collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any'] #FlatMap requires that the function passed to it return a list; beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: x) will raise an error #as always with PCollection, the order is arbitrary #- so it could be [1, 2, 3, 'any', 'any', 'any'] #Map transformation is "one to one" mapping on each element of list/collection. {"Amar", "Akabar", "Anthony"} -> {"Mr.Amar", "Mr.Akabar", "Mr.Anthony"} #FlatMap transformation is usually on collection like "list of list", #this collection gets flattened to single list and transformation/ #mapping is applied on each element of "list of list"/collection { {"Amar", "Akabar"}, "Anthony"} -> {"Mr.Amar", "Mr.Akabar", "Mr.Anthony"}

What's the Difference Between Batch and Streaming Processing?
In simple words: Stream="Continuous data"/"Real time data", Batch='Window of data'.

A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. Streaming processing deals with continuous data and is key to turning big data into fast data.

Batch processing is most often used when dealing with very large amounts of data, and/or when data sources are legacy systems that are not capable of delivering data in streams.

Use cases for batch processing: Payroll, Billing, Orders from customers
Use cases for stream processing: fraud detection, log monirtoring, social media sentiment analysis

ParDo is useful for a variety of common data processing operations, including: Filtering a dataset: You can use Pardo to consider each element in a Pcollection and either output thtat element to a new collection or discard it.

The fundamental piece of every Beam program, a Pipeline contains the entire data processing task, from I/O to data transforms. An illustration of a 4-step pipeline is showed by the code snippet below; the three key elements I/O transform, PCollection, and PTransform are wrapped inside the Pipeline.

How to use Pandas in apache beam?
pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.

It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.

Monday, July 26, 2021

Data Science Study Notes: Pandas user-defined function(UDF)

3 Methods for Parallelization in Spark
Native Spark: if you’re using Spark data frames and libraries (e.g. MLlib), then your code we’ll be parallelized and distributed natively by Spark.

Thread Pools: The multiprocessing library can be used to run concurrent Python threads, and even perform operations with Spark data frames. Using thread pools this way is dangerous, because all of the threads will execute on the driver node. If possible it’s best to use Spark data frames when working with thread pools, because then the operations will be distributed across the worker nodes in the cluster.

The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.

Using subprocess,multiprocess or multi threads in the pyspark environment might easily cause some errors: function not exists on the JVM, it might be perfectly fine if you run on process! But not available on the multiple process run parallel.

# spark version
from pyspark.ml.regression import RandomForestRegressor

# define a function to train a RF model and return metrics 
def mllib_random_forest(trees, boston_train, boston_test):

    # train a random forest regressor with the specified number of trees
    rf = RandomForestRegressor(numTrees = trees, labelCol="target")
    model = rf.fit(boston_train)

    # make predictions
    boston_pred = model.transform(boston_test)
    r = boston_pred.stat.corr("prediction", "target")

    # return the number of trees, and the R value 
    return [trees, r**2]
  
# run the tasks 
pool.map(lambda trees: mllib_random_forest(trees, boston_train, boston_test), parameters)
  
Powerful Pandas UDFs: A new feature in Spark that enables parallelized processing on Pandas data frames within a Spark environment.

Note: Pandas UDF is not the same as Python UDF. Pandas UDF has much better performance.
A pandas user-defined function (UDF)—also known as vectorized UDF—is a user-defined function that uses Apache Arrow to transfer data and pandas to work with the data. pandas UDFs allow vectorized operations that can increase performance up to 100x compared to row-at-a-time Python UDFs.

Two types of Pandas UDF: Grouped Map Pandas UDF vs Scalar Pandas UDFs

To define a scalar Pandas UDF, simply use @pandas_udf to annotate a Python function that takes in pandas.Series as arguments and returns another pandas.Series of the same size. Below we illustrate using two examples: Plus One and Cumulative Probability.

#Using row-at-a-time UDFs, very slow: 

from pyspark.sql.functions import udf

# Use udf to define a row-at-a-time udf  
@udf('double')
# Input/output are both a single double value
def plus_one(v):
      return v + 1

df.withColumn('v2', plus_one(df.v))


#Using Pandas UDFs, much faster:

from pyspark.sql.functions import pandas_udf, PandasUDFType

# Use pandas_udf to define a Pandas UDF
@pandas_udf('double', PandasUDFType.SCALAR)
# Input/output are both a pandas.Series of doubles

def pandas_plus_one(v):
    return v + 1

df.withColumn('v2', pandas_plus_one(df.v))
The examples above define a row-at-a-time UDF “plus_one” and a scalar Pandas UDF “pandas_plus_one” that performs the same “plus one” computation. The UDF definitions are the same except the function decorators: “udf” vs “pandas_udf”.

Grouped Map Pandas UDFs
Python users are fairly familiar with the split-apply-combine pattern in data analysis. Grouped map Pandas UDFs are designed for this scenario, and they operate on all the data for some group, e.g., “for each date, apply this operation”.

Grouped map Pandas UDFs first splits a Spark DataFrame into groups based on the conditions specified in the groupby operator, applies a user-defined function (pandas.DataFrame -> pandas.DataFrame) to each group, combines and returns the results as a new Spark DataFrame.

Subtract Mean: This example shows a simple use of grouped map Pandas UDFs: subtracting mean from each value in the group.


@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def subtract_mean(pdf):
    return pdf.assign(v=pdf.v - pdf.v.mean())

df.groupby('id').apply(subtract_mean)

In this example, we subtract mean of v from each value of v for each group.
The grouping semantics is defined by the “groupby” function, 
i.e, each input pandas.DataFrame to the user-defined function has the same “id” value. 
The input and output schema of this user-defined function are the same, 
so we pass “df.schema” to the decorator pandas_udf for specifying the schema.

Grouped map Pandas UDFs can also be called as standalone Python functions on the driver. 
This is very useful for debugging, for example:

sample = df.filter(id == 1).toPandas()
# Run as a standalone function on a pandas.DataFrame and verify result
subtract_mean.func(sample)

# Now run with Spark
df.groupby('id').apply(substract_mean)
In the example above, we first convert a small subset of Spark DataFrame to a pandas.DataFrame, and then run subtract_mean as a standalone Python function on it. After verifying the function logics, we can call the UDF with Spark over the entire dataset.

Ordinary Least Squares Linear Regression
The last example shows how to run OLS linear regression for each group using statsmodels. For each group, we calculate beta b = (b1, b2) for X = (x1, x2) according to statistical model Y = bX + c.


import statsmodels.api as sm
# df has four columns: id, y, x1, x2

group_column = 'id'
y_column = 'y'
x_columns = ['x1', 'x2']
schema = df.select(group_column, *x_columns).schema

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def ols(pdf):
    group_key = pdf[group_column].iloc[0]
    y = pdf[y_column]
    X = pdf[x_columns]
      X = sm.add_constant(X)
    model = sm.OLS(y, X).fit()

    return pd.DataFrame([[group_key] + [model.params[i] \
        for i in   x_columns]], columns=[group_column] + x_columns)

beta = df.groupby(group_column).apply(ols)
This example demonstrates that grouped map Pandas UDFs can be used with any arbitrary python function: pandas.DataFrame -> pandas.DataFrame. The returned pandas.DataFrame can have different number rows and columns as the input.

Sunday, June 6, 2021

Python Study notes: how do we use Underscore(_) in Python

You will find max six different uses of underscore(_). If you want you can use it for different purposes after you have an idea about underscore(_).

  1. Use In Interpreter
  1. Ignoring Values
  1. Use In Looping
  1. Separating Digits Of Numbers
  1. Naming

    5.1. Single Pre Underscore

    5.2. Single Post Underscore

    5.3. Double Pre Undescores

    5.4. Double Pre And Post Underscores

Let's all the uses briefly with examples.

1. Use In Interpreter Python automatically stores the value of the last expression in the interpreter to a particular variable called "_." You can also assign these value to another variable if you want. To hold the result of the last executed expression in an interactive interpreter session. You can use it as a normal variable. See the example
>>> 5 + 4
9
>>> _     # stores the result of the above expression
9
>>> _ + 6
15
>>> _
15
>>> a = _  # assigning the value of _ to another variable
>>> a
15

2. Ignoring Values

Underscore(_) is also used to ignore the values. If you don't want to use specific values while unpacking, just assign that value to underscore(_).

Ignoring means assigning the values to special variable underscore(_). We're assigning the values to underscore(_) given not using that in future code.

See the example

## ignoring a value
a, _, b = (1, 2, 3) # a = 1, b = 3
print(a, b)

## ignoring multiple values
## *(variable) used to assign multiple value to a variable as list while unpacking
## it's called "Extended Unpacking", only available in Python 3.x
a, *_, b = (7, 6, 5, 4, 3, 2, 1)
print(a, b)

3. Use In Looping

You can use underscore(_) as a variable in looping. See the examples below to get an idea.

## lopping ten times using _
for _ in range(5):
    print(_)
##output: 0,1,2,3,4

## iterating over a list using _
## you can use _ same as a variable
languages = ["Python", "JS", "PHP", "Java"]
for _ in languages:
    print(_)

_ = 5
while _ < 10:
    print(_, end = ' ') # default value of 'end' id '\n' in python. we're changing it to space
    _ += 1

4. Separating Digits Of Numbers

If you have a long digits number, you can separate the group of digits as you like for better understanding.

Ex:- million = 1_000_000

Next, you can also use underscore(_) to separate the binary, octal or hex parts of numbers.

Ex:- binary = 0b_0010, octa = 0o_64, hexa = 0x_23_ab

Execute all the above examples to see the results.

## different number systems
## you can also check whether they are correct or not by coverting them into integer using "int" method
million = 1_000_000
binary = 0b_0010
octa = 0o_64
hexa = 0x_23_ab

print(million)  ##output: 1000000
print(binary)  ##output:2
print(octa)  ##output: 52
print(hexa) ##output 9131

5. Naming Using Underscore(_)

Underscore(_) can be used to name variables, functions and classes, etc..,

  • Single Pre Underscore:- _variable
  • Signle Post Underscore:- variable_
  • Double Pre Underscores:- __variable
  • Double Pre and Post Underscores:- __variable__

5.1. _single_pre_underscore

_name

Single Pre Underscore is used for internal use. Most of us don't use it because of that reason.

See the following example.

class Test:

    def __init__(self):
        self.name = "datacamp"
        self._num = 7

obj = Test()
print(obj.name) ##output: datacamp
print(obj._num)  ##output: 7

single pre underscore doesn't stop you from accessing the single pre underscore variable.

But, single pre underscore effects the names that are imported from the module.

Let's write the following code in the my_funtions file.

## filename:- my_functions.py

def func():
    return "datacamp"

def _private_func():
    return 7

Now, if you import all the methods and names from my_functions.py, Python doesn't import the names which starts with a single pre underscore.

>>> from my_functions import *
>>> func()
'datacamp'
>>> _private_func()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_private_func' is not defined
You avoid the above error by importing the module normally:
>>> import my_functions
>>> my_functions.func()
'datacamp'
>>> my_functions._private_func()
7

Single Pre Underscore is only meant to use for the internal use.

5.2 single_postunderscore

name_

Sometimes if you want to use Python Keywords as a variable, function or class names, you can use this convention for that.

You can avoid conflicts with the Python Keywords by adding an underscore at the end of the name which you want to use.

Let's see the example.

>>> def function(class):
  File "<stdin>", line 1
    def function(class):
                 ^
SyntaxError: invalid syntax
>>> def function(class_):
...     pass
...
>>>

Single Post Underscore is used for naming your variables as Python Keywords and to avoid the clashes by adding an underscore at last of your variable name.

5.3. Double Pre Underscore

__name

Double Pre Underscores are used for the name mangling.

Double Pre Underscores tells the Python interpreter to rewrite the attribute name of subclasses to avoid naming conflicts.

  • Name Mangling:- interpreter of the Python alters the variable name in a way that it is challenging to clash when the class is inherited.

It's way more complext, not worthy the time to deal with that.

5.4. Double Pre And Post Underscores

__name__

In Python, you will find different names which start and end with the double underscore. They are called as magic methods or dunder methods.

class Sample():

    def __init__(self):
        self.__num__ = 7

obj = Sample()
obj.__num__

This will lead to the clashes if you use these methods as your variable names. So, it's better to stay away from them.

Saturday, May 1, 2021

NLP study notes:

word embeddding: collective term of models that learned to map a set of words or phrases in a vocabulary to vectors of numrical values.
Neural Networks are designed to learn from numerical data.
Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding.

This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.

General approach for dealing with words in your text data is to one-hot encode your text. You will have tens of thousands of unique words in your text vocabulary. Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0. So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 values。

We use embeddings to solve this problem and greatly improve the efficiency of our network. Embeddings are just like a fully-connected layer. We will call this layer as— embedding layer and the weights as — embedding weights.

So, we use this Weight Matrix as lookup table. We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764. Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix. This process is called Embedding Lookup. The number of dimension in the hidden layer output is the embedding dimension:
To reiterate :
a) The embedding layer is just a hidden layer
b) The lookup table is just a embedding weight matrix
c) The lookup is just a shortcut for matrix multiplication
d) The lookup table is trained just like any weight matrix

Popular off-the-shelf word embedding models in use today:
Word2Vec (by Google)
GloVe (by Stanford)
fastText (by Facebook)

Word2Vec:
This model is provided by Google and is trained on Google News data. This model has 300 dimensions and is trained on 3 million words from google news data. Team used skip-gram and negative sampling to build this model. It was released in 2013.

GloVe:
Global Vectors for words representation (GloVe) is provided by Stanford. They provided various models from 25, 50, 100, 200 to 300 dimensions based on 2, 6, 42, 840 billion tokens. Team used word-to-word co-occurrence to build this model. In other words, if two words co-occur many times, it means they have some linguistic or semantic similarity.

fastText:
This model is developed by Facebook. They provide 3 models with 300 dimensions each. fastText is able to achieve good performance for word representations and sentence classifications because they are making use of character level representations. Each word is represented as bag of characters n-grams in addition to the word itself. For example, for the word partial, with n=3, the fastText representation for the character n-grams is . are added as boundary symbols to separate the n-grams from the word itself.

Data Science Study Notes: recommendation engine notes 1: Deep matrix factorization using Apache MXNet

Deep matrix factorization using Apache MXNet ( notes from Oreilly , github notebook ) Recommendation engines are widely used models th...