Friday, December 13, 2019

Python study notes 2: F-score, Adaptive boosting VS Gradient boosting VS XGBoost, softmax, KNN,

What's F score? What's precision/recall?
Why is called "Random" "Forest"?
Adaptive boosting VS Gradient boosting?
Gradient boosting VS Random Forest?
Adaboost VS Gradient boosting VS XGBoost
L1-norm vs L2-norm in machine learning
Gradient Descent VS Stochastic Gradient Descent
What's deep neural network? Why called deep?
How do we overcome local minima issue?
What's softmax? Where is the soft coming from?
What's KNN(k-nearest neighbor) vs K-means clustering

Question: What's the difference between validation and testing dataset?
A validation dataset is a dataset of examples used to tune the hyperparameters of a classifier. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation dataset in addition to the training and test datasets.

A validation dataset is a sample of data held back from training your model, it is used to give an estimate of model skill while tuning model's hyperparameters. The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models. cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1) n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.  

Question: What's precision/recall?What's F score?  
Answer: Precision(True positive rate %) = True positive/ All predited positive or Precision("quality")= 1- Type-I error = 1- False positive rate Recall ("sensitive") =True positive/ All actual postiveRecall ("completeness/quantity") =True positive/ (True positive+ False positive)We can use the following graph to explain:
In a search engine case, when a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is "how useful the search results are", andrecall is "how complete the results are".

In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results. F-score or F-measure is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score.
The F1 score is the harmonic average of the precision and recall. The harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between. Since the harmonic mean of a list of numbers tends strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate/reduce the impact of large outliers and aggravate/enlarge the impact of small ones.Two other commonly used F measures are theF_2 measure, which weighs recall higher than precision (by placing more emphasis on false negatives), and the F_0.5 measure, which weighs recall lower than precision (by attenuating the influence of false negatives).

The F-measure was derived so that F_beta"measures the effectiveness of retrieval with respect to a user who attaches beta times as much importance to recall as precision". Note the beta in the formula is not the probability of type-II we learned in statistics.

Question: What's the KS Statistics? How do we understand that intuitively?
Kolmogorov-Smirnov (KS) Statistics: one of the most powerful metrics to validate predictive models. It's widely used in BFSI domains(Banking, Financial Services and Insurance Risk). It is to compare the difference of two cumulative distributions(usually “True Negative” rate – “False Positive” rate), find the maximum difference.

Here is some sample code to calculate the K-S statistics in python, x-axis is the false positive(incorrectly positive/total true negative), y-axis is the true positive(Correctly positive/total true positive).
all0=all0[0:0]xs = [(x+1) * 0.1 for x in range(9)]
xs = xs +[0.85,0.95,0.975,0.99]
for var in ('match_livingarea','match_yearbuilt'):
for pcnt1 in xs :
print(pcnt1, 'for match : ', var)
check2['bad_cut']=np.where(check2.score>pcnt1, '1', 
check2a=check2[(check2.bad_cut!='missing') & (check2[var]!='missing')].groupby([var,'bad_cut'],as_index=False)
.agg({"puid_x": 'count'}).sort_values(['puid_x'],ascending=False) 
check2b=check2a.groupby([var],as_index=False).agg({"puid_x": 'sum'})
check2d1=check2c.pivot_table(index=['bad_cut'], columns=var, values='pcnt')
check2c.pivot_table(index=['bad_cut'], columns=var, values='puid_x_x')
#big trouble if use the following:
#it will return null due to the setup: inplace=True

Question: What's Bias-Variance tradeoff?  
“simpler models have high bias and low variance whereas more complex or sophisticated models have low bias and high variance”

“high bias leads to under-fitting and high variance leads to over-fitting”.

Low variance (high bias) algorithms tend to be less complex, with simple or rigid underlying structure. They train models that are consistent, but inaccurate on average. These include linear or parametric algorithms such as regression and naive Bayes. On the other hand, low bias (high variance) algorithms tend to be more complex, with flexible underlying structure. They train models that are accurate on average, but inconsistent. These include non-linear or non-parametric algorithms such as decision trees and nearest neighbors. This tradeoff in complexity is why there's a tradeoff in bias and variance - an algorithm cannot simultaneously be more complex and less complex.

Error = Bias^2 + Variance (Total Error might have some extra un-explanable random noise)

High Variance

Assume we only have 100 training examples. If we attempt to fit these few data points to a neural network with 10000+ parameters, even the slightest change in the input data is likely to lead to a completely different trained model. Thus, we have high variance.

It is tempting to say that in this case we automatically have low bias. This is false. Recall that the bias does not depend on data. In fact, we don’t know the complexity of the underlying ground truth function f and it may itself have one million parameters. In this case, we would also have high bias.

Question: Why is called Random Forest? How do we explain random forest intuitively? What's forest coming from? What's Random coming from?

1. Why do we call "Forest"? Random forests is a collection of many decision trees. Instead of relying on single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called - a forest. So you now understand why is it called forest.
The key to understanding random forests is to understand bootstrap sampling. There's little use compiling five or ten identical decision tree models. There needs to be some variation and that's why bootstrap sampling draws on the same dataset but extracts a different variation of the data at each turn. Hence, in growing random forests, multiple varying copies of the training data are first run through each of the trees. The results from each tree are then compared and voted on to create an optimal tree to produce the final model or what is known as the "final class". A downside, though, of using random forests is that you sacrifice the visual simplicity and ease of interpretation that comes with a single decision tree and instead return a black-box technique.

 2. Why is it called random then?
Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. Say our dataset has 1,000 rows and 30 columns. There are two levels of randomness in this algorithm:  

At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.  

At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on. Let us draw an analogy to explain: Let us now understand how interview selection process resembles a random forest algorithm. Each panel in the interview process is actually a decision tree. Each panel gives a result whether the candidate is a pass or fail and then a majority of these results is declared as final. Say there were 5 panels, 3 said yes and 2 said no. The final verdict will be yes. Something similar happens in random forest as well. The results from each of the tree is taken and final result is declared accordingly. 

Voting and averaging is used to predict in case of classification and regression respectively. When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance.

The out-of-bag (oob) error estimate: In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.  

Question: What is AdaBoosting? What's Gradient Boosting? What's the main difference between gradient boosting and random forest?  

Why is called boosting? Boosting is an ensemble technique in which new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.
AdaBoost, short for Adaptive Boosting. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.  

AdaBoost is sensitive to noisy data and outliers. Gradient Boosting is basically about "boosting" many weak predictive models into a strong one, in the form of ensemble of weak models. Here, a weak predict model can be any model that works just a little better than random guess. To build the strong model, we need to find a good way to "combine" weak models.

In AdaBoost, arguably the most popular boosting algorithm, weak models are trained in an adaptive way (AdaBoost, and other boosting models, can be used for both classification and regression. Classification model is used here as an example):
1. Train a weak model using data samples drawn according to some weight distribution

2. Increase the weight of samples that are misclassified by model m, and decrease the weight of samples that are classified correctly by model m. In other words, At each iteration, weights are added to the training data based on the results of the previous iteration. Higher weighting is applied to instances that were incorrectly predicted from the training data and instances that were correctly predicted receive less weight.

3. Train next weak model using samples drawn according to the updated weight distribution In this way, the algorithm always trains models using data samples that are "difficult" to learn in previous rounds, which results an ensemble of models that are good at learning different "parts" of training data.  

What's the differences between Adaptive boosting and Gradient boosting?

Both are boosting algorithms which means that they convert a set of weak learners into a single strong learner. They both initialize a strong learner (usually a decision tree) and iteratively create a weak learner that is added to the strong learner.

They differ on how they create the weak learners during the iterative process.At each iteration, adaptive boosting changes the sample distribution by modifying the weights attached to each of the instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly predicted instances.

The weak learner thus focuses more on the difficult instances. After being trained, the weak learner is added to the strong one according to his performance (so-called alpha weight). The higher it performs, the more it contributes to the strong learner.

On the other hand, gradient boosting doesn't modify the sample distribution. Instead of training on a newly sample distribution, the weak learner trains on the remaining errors (so-called pseudo-residuals) of the strong learner. It is another way to give more importance to the difficult instances.

At each iteration, the pseudo-residuals are computed and a weak learner is fitted to these pseudo-residuals. Then, the contribution of the weak learner (so-called multiplier) to the strong one isn't computed according to his performance on the newly distribution sample but using a gradient descent optimization process. The computed contribution is the one minimizing the overall error of the strong learner.
So what's the main difference between gradient boosting and random forest? 
For random forest, we are training multiple trees in parallel, using random subset features and random subset of records. However, for gradient boosting, we are traing the decision tree subsequentially, using Adaboost at each stage for the next tree.  

What's the hierarchy among Adaboost, Gradient boosting and XGBoost?

In terms of "advance/speed/powerful" level:XGBoost --> Gradient Boosting --> Adaboost, that's the reason you heard people talking and using XGBoost much more often than Adaboost.

Adaboost was the original implementation of boosting, with a single cost function and a difficulty in adapting to different link functions to create a linear model with a given outcome. Gradient boosting generalizes the framework and allows for easier computation. It can use multiple baselearner types (trees, linear terms, splines...), and cost functions and link functions are modifiable.

XGBoost uses a few computational tricks that exploit a computer's hardware to speed up gradient descent and line search components, as well as a penalty function (similar to elastic net penalties) to allow for robust, sparse modeling (which also speeds up the algorithm).
Question: What's intuitive explanation to Gradient Descent?What's stochastic gradient descent?  

Answer:Anyone with decent understanding of english/math knows, Gradient means "Slope" and Descent means "To Go Down". So in the process of searching minimum value, gradient Descent uses the slope of the curve to determine the moving direction to get the minimum value. See the following example of 2 intuitive cases:
Case 1: Initial value of w is on the falling edge, i.e point 1.
In this case, we need to find the direction for the value to going down. Gradient descent will try to reach at 4,i.e. the local minimum. Ideally we want to reach at the global minimum, but since gradient decent always decend and never climbs, once at 4 it will never climb up to jump to 5. At the falling edge 1 the slope of the function E(w), i.e. dE/dw is negative. Hence, to go down the slope, we have to increase the value of w, in small steps. The size of these steps is called Learning rate, commonly denoted as "alpha"

Case 2: Initial value of w is on the rising edge, i.e point 10. At the rising edge 10 the slope of the function E(w), i.e. dE/dw is positive. Hence, to go down the slope, we have to decrease the value of w, in small steps.

Note that Gradient descent is sensitive to initial value of parameters: If we chose 1 as our starting point for w then we are more likely to end up at the local minimum 4. If we chose 10 as starting point for w then we are more lileky to end up at the local minimum 7. Ideally, we should chose the starting value of w, some where near 5. Well, there is no way of knowing this and this stays a drawback of gradient descent.

Also It is important to chose the appropriate value of learning rate (a): If the value of a is too high , we may end up bouncing between two points and may never reach minima. If the value of a is too low , we will proceed towards minima very slowly and thus the algorithm may take very long time to converge.  

Question: XGBoost VS Catboost?
CatBoost is an open-source algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex. It is in open-source and can be used by anyone.

Improve your training results with CatBoost that allows you to use non-numeric factors, instead of having to pre-process your data or spend time and effort turning it to numbers.

Question: Batch/Natural Gradient Descent VS Stochastic Gradient Descent?

Why do we needStochastic Gradient Descent? The problem with Gradient Descent is when the data set is huge, calculating the parameters are expensive.

For example if there are 1 billion sample points. It has to go through the 1 billion sample points on each iteration to calculate the parameters. In SGD,a sample of training set or one training value is used to calculate the parameters instead of the entire sample space on each iteration. This is much faster.

For example, if you want to travel from A to B and you don't have GPS. But you can call 100 customer centers to make sure you are in the right direction. There are two ways to reach.After you drive x distance, you call all the 100 call centers to see if you are in right direction and correct your self until you reach B. It means you have to call all the 100 customer centers every time you drive x distance (consumes lot of time). It means you made "n*100" calls. or, After you drive x distance, You pick a call center randomlyand call if you are in the right direction until you reach B (you have to make few calls).

It means you made "n" calls. Another analogy is the situation how we judge humans: When we first meet a person, we have little data on them, equivalent to a random data point we start the SGD algorithm with. As gain more data about them, like their personality and food choices, our judgement of the individual gets more accurate and wholesome(hopefully). This way as we get more data points, our mental model of the individual becomes more accurate.

Similarly, a SGD algorithm starts at a random point, updates the cost function with each iteration using one data point at a time and builds a model with progressively higher accuracy given a large data set. The algorithm is commonly trained on millions or billions of data points which makes computation of gradient descent computationally expensive.  

Question: What's the difference in L1-norm and L2-norm in machine learning?  

Answer: The difference of L1-norm and L2-norm depends on the different applications: As an error function(most of situations) and As regulization.  
A good way to summarize machine learning:
A good way to summarize supervised learning:
Robustness: Intuitively speaking, since a L2-norm squares the error (increasing by a lot if error > 1), the model will see a much larger error ( e vs e^2 ) than the L1-norm, so the model is much more sensitive to those extreme value, and adjusts the model to minimize this error. If this value is an outlier, the model will be adjusted to minimize this single outlier case, at the expense of many other common examples, since the errors of these common examples are small compared to that single outlier case.

Therefore, L1-Norm Least absolute deviations is more robust in that it is resistant to outliers in the data.Stability:If not stable, for a small horizontal adjustment of a datum, the regression line may jump a large amount.  

Sparsity refers to that only very few entries in a matrix (or vector) is non-zero. L1-norm has the property of producing many coefficients with zero values or very small values with few large coefficients, therefore, L1-norm/Lasso shrinks the less important feature's coefficient to zero, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. 

Computational efficiency. L1-norm does not have an analytical solution, but L2-norm does. This allows the L2-norm solutions to be calculated computationally efficiently. However, L1-norm solutions does have the sparsity properties which allows it to be used along with sparse algorithms, which makes the calculation more computationally efficient.

For the Regulization purpose, LASSO (least absolute shrinkage and selection operator) regression is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the model.
From the figure, one can see that the constraint region defined by the L_1 norm is a square rotated so that its corners lie on the axes, while the region defined by the L_2 norm is a circle , which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner of a hypercube, for which some components of betaare identically zero, while in the case of an n-sphere, the best solution on the boundary are less likely to have some of the coefficients of beta to be 0.  

Question: What's python generator? What's the difference from the regular function?  
Answer: Python generator is a fancy to say a function to return with a series of values. Officially, a Python generator is a function which returns a generator iterator (just an object we can iterate over) by calling yield. which is the key difference from the regular function usually ending with "return". The yield statement is only used when defining a generator function and is only used in the body of the generator function.

When we call a normal Python function, execution starts at function's first line and continues until a return statement, exception, or the end of the function (which is seen as an implicit return None) is encountered. Once a function returns control to its caller, that's it.  

Any work done by the function and stored in local variables is lost. A new call to the function creates everything from scratch. The python generator resumes execution from where it called yield, not from the beginning of the function.  

return sends a specified value back to its caller, whereas yield can produce a sequence of values. We should use yield when we want to iterate over a sequence but don’t want to store the entire sequence in memory.

The yield statement is not allowed in thetry clause of a try ... finally construct. The difficulty is that there's no guarantee the generator will ever be resumed, hence no guarantee that the finally block will ever get executed.

All of the state, like the values of local variables, is recovered and the generator contiues to execute until the next call to yield. Execution time is faster in case of yield for large data size. Here is an example of using yield generator to genrate a series of smaller data slices from a big one:
def simpleGeneratorFun():
    yield 1
    yield 2
    yield 3

for value in simpleGeneratorFun():
#The output of this code will be:
def testyield():
  yield "Welcome to Python Tutorials"
output = testyield()


for i in output:
Welcome to Python Tutorials
def sent_to_words(sentences):
	for sentence in sentences:
		yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
		# deacc=True removes punctuations

#call the generator function: 
data_words = list(sent_to_words(data))

def get_chunk_data(df,chunk_size=10000):
	for i in range(0,df.shape[0],chunk_size):
		yield df[i:i+chunk_size]
for chunk in get_chunk_data(Aug_wk0,chunk_size=100000):
  print('\n chunk', chunk.index[0], '-', chunk.index[-1])
  print('The first record: ', chunk.head(1))
  print('The last record: ', chunk.tail(1))
This will generate a list of smaller dataframes, and stored as a list of dataframe to iternate over later.

What's deep neural network? Why called deep? ? From the following graph explanation, you can see deep neural network is really several composition of functions, g(f(k(l(m(n(x_1,x_2,...))))). In other words, there are several middle hidden layers, which is the reason we call "deep" neural network.
Every time we need to use a neural network, we won't need to code the activation function, gradient descent, etc. There are lots of packages for this, which we recommend you to check out, including the following: Keras , TensorFlow, Caffe, Theano, Scikit-learn Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.  

Building a Neural Network in Keras Here are some core concepts you need to know for working with Keras: Sequential Model
from keras.models import Sequential#Create the Sequential model
model = Sequential()
The keras.models.Sequential class is a wrapper for the neural network model that treats the network as a sequence of layers. It implements the Keras model interface with common methods like compile(), fit(), and evaluate() that are used to train and run the model. We'll cover these functions soon, but first let's start looking at the layers of the model.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation,BatchNormalization,dropout
# X has shape (num_rows, num_cols), where the training data are stored
# as row vectors
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
# y must have an output vector for each input vector
y = np.array([[0], [0], [0], [1]], dtype=np.float32)
# Create the Sequential model
model = Sequential()
# 1st Layer - Add an input layer of 128 nodes with the same input shape as
model.add(Dense(128, input_dim=X.shape[1], 
#Dropout is a technique where randomly selected neurons are ignored during training. 
#They are 'dropped-out' randomly. This means that their contribution 
#to the activation of downstream neurons 
#is temporally removed on the forward pass and any weight updates are not applied 
to the neuron on the backward pass.model.add(BatchNormalization())
#Batch normalization layer:Normalize the activations of the previous layer at each batch,
#applies a transformation that maintains the mean activation close to 0 
and the activation sd close to 1.
model.add(Dense(64,kernel_initializer=initializers.uniform(seed=123), activation='relu'))
#ReLU is the most commonly used activation function in neural networks, 
#especially in CNNs. If you are unsure, ReLU is usually a good first choice.
model.add(BatchNormalization())model.add(Dense(16, activation='softmax')
# Add a softmax activation layer
# 2nd Layer - Add a fully connected output layer
model.add(Dense(1))# Add a sigmoid activation layer
what's the difference between model.evaluate and model.predict? The model.evaluate function predicts the output for the given input and then computes the metrics function specified in the model.compile and based on y_true and y_pred and returns the computed metric value as the output.The model.predict just returns back the y_pred.
score_test = model.evaluate(X_test, y_test, batch_size=128)
score_valid = model.evaluate(X_valid, y_valid, batch_size=128)
In the output, it will give you two scores, the frist one score_valid[0] for loss function, the 2nd one for score_valid[1] for metrics definition.After we train the model, we can save the model output in 2 ways: one to save the entire model, another to save model weight and Architecture separately.
# Suppose we have a model
from keras.applications import resnet50
model = resnet50.ResNet50(include_top=True, weights='imagenet')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')# Import dependencies
import json
from keras.models import model_from_json, load_model# Option 1: Save Weights + Architecture
with open('model_architecture.json', 'w') as f:
# Option 1: Load Weights + Architecture
with open('model_architecture.json', 'r') as f:
new_model_1 = model_from_json(
new_model_1.load_weights('model_weights.h5')# Option 2: Save/load the entire model'my_model.h5')
new_model_2 = load_model('my_model.h5')
Note: It is not recommended to use pickle or cPickle to save a Keras model. But for the Scikit-Learn Models, you can use Python's pickle module or sklearn's sklearn.externals.joblib, which is more efficient at serializing large Numpy arrays.
from sklearn.externals import joblib# Saving a model
joblib.dump(my_model, '_model.pkl')# Loading a model
loaded_model = joblib.load('my_model.pkl')
How do we overcome local minima issue?

When we are trying to search for the global mininum/maximum, sometimes we might end up with the local minimum/maxinum due to the learning rate setup in the gradient descent. To overcome this issue, there are several solutions:

1): Random start. We can choose different random start points for the traning.

2) Learning rate decay: instead of having a fix(same) learning rate at different steps, we can use smaller and smaller learning rate as the training continues. In general, smaller learning rate works better than the large learning rate.

3) SGD : Stochastic Gradient Descent. It uses the following parameters:a) Learning rate. b) Momentum (This takes the weighted average of the previous steps,= beta*prevous_1_step+ beta^2 * previous_2nd_step + beta^3 * previous_3rd_step+... in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).c) Nesterov Momentum (This slows down the gradient when it's close to the solution).

4) Adam (Adaptive Moment Estimation): uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.

5): RMSProp: RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.  

What is the difference between He normal and Xavier normal initializer in keras. Both seem to initialize weights based on variance in the input data. Any intuitive explanation for the difference between both?

In summary, the main difference for machine learning practitioners is the following: He initialization works better for layers with ReLu activation. Xavier initialization works better for layers with sigmoid activation.  

What's softmax? Where is the soft coming from? 
say, we hav the neural network output for each class (our classes now being "good" and "bad"). We then pass these values through an operation called "softmax". The output of softmax is the probability of each class. For example, say that layer of the network outputs 2 for "good" and 4 for "bad", if we feed [2, 4] to softmax, it will return [0.11, 0.88] as the output. Which translates the values to say the network is 88% sure that the inputted value is "bad" and our friend would not like that house.Here is a more clear explanation:
Softmax takes an array and outputs an array of the same length. Notice that its outputs are all positive and sum up to 1, which is useful when we're outputting a probability value. Also notice that even though 4 is double 2, its probability is not only double(as of 67% vs 33%), but is eight times that of 2. This is a useful property that exaggerates the difference in output thus improving our training process.
What's KNN(k-Nearest Neighbor) vs K-means clustering? 
KNN(k-Nearest Neighbor) can be used for classficiation/clustering and regression. When we try to class a data points, we are using the nearest k-data points as training, trying to see which is the majority class out of the k data points, and assgign the testing point to that class.
K-means clustering specifically asks partition n-objects into k clusters, is a subset of unsupervised learning. Initiallly k data points was chosen randamly(different initial points might lead to different clusters), then calculate the difference for each data point to those k data points, find the minimum distance, and assign to that point/cluster. After doing for each data point, then calculate the center of eacn cluster.

How do we find out the best number of K?
There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is that it plots the various values of variance(cost) with changing k. There will be a point value of K that showing singificant dropping with that many K clusters, but with the extra one more cluster, you didn't see any more significant drop, it the curve is getting more flat. That's what they call elbow method.

What's the difference between CNN(convolutional neural network) and MLP(multilayer perceptron) ? 

The main difference is that the convolutional neural network (CNN) has layers of convolution and pooling. The convolution layer convolves an area, or a stuck of elements in input data, into smaller area to extract feature. It is done by filtering an ara, which is as same as to multiplying weights to an input data.

The pooling layer picks an data with the highest value within an area. These layers act to extract an important feature from the input for classification.
An other difference is the use of pooling layers. The input is scaled down to reduce even more the number of parameters, while conserving enough informations. This is very useful in image classification, where the last layers of the network need to determine whether an object is present in the scene, but not where.
MLP usually breaks the image into a simple vector of slices of images, in other words, there is no location separation indicator; while CNN will consider the location,nearby location makes more correlation.

In other words, Convolutional layers take advtage of the local spatial coherence of the input. This is only possible because we assume that spatially close inputs are correlated. For images, this can be seen by the fact that the image loses its meaning when the pixels are shuffled.

Mathematically, Convolutional layer define a kernel (weight matrix) which is then multiplied element by element with a section of the input of the same size. Sum all the resulting values. Move the kernel a number of pixels equal to a defined stride size and repeat until you go throught the whole image. Afterwards apply an activation function to each one of the values. Now repeat all of this for each filter in your layer.
Using this property, CNNs are able to cut down on the number of parameter by sharing weights. This makes them extremely efficient in image processing, compared to multi-layer perceptrons.
Example of constructinig Convolutional Layers in Keras

To create a convolutional layer in Keras, you must first import the necessary module:  
from keras.layers import Conv2D

Then, you can create a convolutional layer by using the following format:
relu stands for rectified linear unit, relu(x)=max(0,x)  
Conv2D(filters, kernel_size, strides, padding, activation='relu', input_shape)

You must pass the following arguments: filters - The number of filters.
kernel_size - Number specifying both the height and width of the (square) convolution window.

There are some additional, optional arguments that you might like to tune:
 strides - The stride of the convolution. If you don't specify anything, strides is set to 1.
padding - One of 'valid' or 'same'. If you don't specify anything, padding is set to 'valid'.
activation - Typically 'relu'. If you don't specify anything, no activation is applied. You are strongly encouraged to add a ReLU activation function to every convolutional layer in your networks.

NOTE: It is possible to represent both kernel_size and strides as either a number or a tuple.When using your convolutional layer as the first layer (appearing after the input layer) in a model, you must provide an additional input_shape argument: input_shape - Tuple specifying the height, width, and depth (in that order) of the input.

NOTE: Do not include the input_shape argument if the convolutional layer is not the first layer in your network.

Example #1: Say I'm constructing a CNN, and my input layer accepts grayscale images that are 200 by 200 pixels (corresponding to a 3D array with height 200, width 200, and depth 1). Then, say I'd like the next layer to be a convolutional layer with 16 filters, each with a width and height of 2. When performing the convolution, I'd like the filter to jump two pixels at a time. I also don't want the filter to extend outside of the image boundaries; in other words, I don't want to pad the image with zeros. Then, to construct this convolutional layer, I would use the following line of code:

Conv2D(filters=16, kernel_size=2, strides=2, activation='relu', input_shape=(200, 200, 1))

 Example #2 Say I'd like the next layer in my CNN to be a convolutional layer that takes the layer constructed in Example 1 as input. Say I'd like my new layer to have 32 filters, each with a height and width of 3. When performing the convolution, I'd like the filter to jump 1 pixel at a time. I want the convolutional layer to see all regions of the previous layer, and so I don't mind if the filter hangs over the edge of the previous layer when it's performing the convolution.

Then, to construct this convolutional layer, I would use the following line of code:
Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')

Example #3 If you look up code online, it is also common to see convolutional layers in Keras in this format:
Conv2D(64, (2,2), activation='relu')

In this case, there are 64 filters, each with a size of 2x2, and the layer has a ReLU activation function. The other arguments in the layer use the default values, so the convolution uses a stride of 1, and the padding has been set to 'valid'.  

Example of Max Pooling Layers in Keras
To create a max pooling layer in Keras, you must first import the necessary module: from keras.layers import MaxPooling2D

Then, you can create a convolutional layer by using the following format:  
MaxPooling2D(pool_size, strides, padding)
Arguments: You must include the following argument:
pool_size - Number specifying the height and width of the pooling window.
There are some additional, optional arguments that you might like to tune:
strides - The vertical and horizontal stride.

If you don't specify anything, strides will default to pool_size. padding - One of 'valid' or 'same'. If you don't specify anything, padding is set to 'valid'.

Example: Say I'm constructing a CNN, and I'd like to reduce the dimensionality of a convolutional layer by following it with a max pooling layer. Say the convolutional layer has size (100, 100, 15), and I'd like the max pooling layer to have size (50, 50, 15). I can do this by using a 2x2 window in my max pooling layer, with a stride of 2, which could be constructed in the following line of code:

MaxPooling2D(pool_size=2, strides=2)

If you'd instead like to use a stride of 1, but still keep the size of the window at 2x2, then you'd use:

MaxPooling2D(pool_size=2, strides=1)

Checking the Dimensionality of Max Pooling Layers by the following:  

from keras.models import Sequential 
from keras.layers import MaxPooling2D 
model = Sequential() 
model.add(MaxPooling2D(pool_size=2, strides=2, input_shape=(100, 100, 15))) model.summary()  

Example of CNN Architecture
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropoutmodel = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', 
input_shape=(32, 32, 3)))
model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(10, activation='softmax'))model.summary()
What's Epoch in deep learning?
In Deep Learning, an epoch is a hyperparameter which is defined before training a model. One epoch is when an entire dataset is passed both forward and backward through the neural network only once. One epoch is too big to feed to the computer at once. So, we divide it in several smaller batches. We use more than one epoch because passing the entire dataset through a neural network is not enough and we need to pass the full dataset multiple times to the same neural network. But since we are using a limited dataset and to optimise the learning and the graph we are using Gradient Descent which is an iterative process. So, updating the weights with single pass or one epoch is not enough. A batch is the total number of training examples present in a single batch and an iteration is the number of batches needed to complete one epoch. For example: If we divide a dataset of 2000 training examples into 500 batches, then 4 iterations will complete 1 epoch. 1 Epoch = 1 Forward pass + 1 Backward pass for ALL training samples.Batch Size = Number of training samples in 1 Forward/1 Backward pass. (With increase in Batch size, required memory space increases.) Number of iterations = Number of passes i.e. 1 Pass = 1 Forward pass + 1 Backward pass (Forward pass and Backward pass are not counted differently.) Example : If we have 1000 training samples and Batch size is set to 500, it will take 2 iterations to complete 1 Epoch. Training set: a set of examples used for learning: to fit the parameters of the classifier In the MLP case, we would use the training set to find the "optimal" weights with the back-prop rule.Validation set: a set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the "optimal" number of hidden units or determine a stopping point for the back-propagation algorithm Test set: a set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further! Why separate test and validation sets? The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further!  

Question: How do we do gridsearchCV tuning the best parameters in keras sequencial model? 

If you get some error message: "Early stopping conditioned on metric "val_loss" which is not available. Available metrics are: loss,acc", it's most likely you either didn't specify the validation dataset or the size of the validation dataset is too small.
Here is some example code:  
grid_result =, y_train,validation_data=(X_test, y_test),callbacks = [EarlyStopping(monitor='val_loss', patience=100), ModelCheckpoint(filepath=file_model, monitor='val_loss', save_best_only=True)]) 

If the grid search process is hanging there with no progress, freezing, then you might want to set up n_jobs=1, instead of n_jobs=-1.
# Use scikit-learn to grid search the batch size and epochs
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# Function to create model, required for KerasClassifier
def create_model():
 # create model
 model = Sequential()
 model.add(Dense(12, input_dim=8, activation='relu'))
 model.add(Dense(1, activation='sigmoid'))
 # Compile model
 model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model
# fix random seed for reproducibility
seed = 7
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(build_fn=create_model, verbose=1)
# define the grid search parameters
batch_size = [10, 20]
epochs = [10, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result =, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))#===============================================================

Transfer learning example: All of the pre-built models come with weights and biases pre-trained from ImageNet 2012 dataset; which is a dataset of 1.2 million images across 1000 classes. If your need is simply to predict if an image is within the 1000 classes of ImageNet dataset, then one can use the pre-trained pre-built models as-is.
from keras.applications import ResNet50
from keras.applications.resnet50 import preprocess_input, decode_predictions
# Get a pre-built ResNet50 model
model = ResNet50(weights='imagenet')# Read the image into memory as a numpy array
image = cv2.imread('elephant.jpg', cv2.IMREAD_COLOR)
# Resize the image to fit the input shape of ResNet model
image = cv2.resize(image, (224, 224), cv2.INTER_LINEAR)
# Preprocess the image using the same image processing used by the pre-built model
image = preprocess_input(image)
# Reshape from (224, 224, 3) to (1, 224, 224, 3) for the predict() method
image = image.reshape((-1, 224, 224, 3))
# Call the predict() method to classify the image
predictions = model.predict(image)
# Display the class name based on the predicted label 
using the decode function for the built-in model.
print(decode_predictions(preds, top=3))
from keras.applications import ResNet50
from keras.layers import Dense
from keras import Model
# Get a pre-trained/pre-built model without the classifier 
#and retain the global  # average pooling 
# layer following the final convolution (bottleneck) layer
model = ResNet50(include_top=False, pooling='avg', weights='imagenet')
# Freeze the weights of the remaining layer
for layer in model.layers:
layer.trainable = False# Add a classifier for 20 classes
output = Dense(20, activation='softmax')(model.output)
# Compile the model for training
model = Model(model.input, output)
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
# Now train the model#===============================================================
Question: What's Isolation Forest? 

The main idea, which is different from other popular outlier detection methods, is that Isolation Forest explicitly identifies anomalies instead of profiling normal data points.

Isolation Forest, like any tree ensemble method, is built on the basis of decision trees.
In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space).
That is why by using such random partitioning they should be identified closer to the root of the tree
(shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.The idea of identifying a normal vs. abnormal observation can be observed in Figure 1 from [1].
A normal point (on the left) requires more partitions to be identified than an abnormal point (right).

No comments:

Post a Comment

Data Science Study Notes: recommendation engine notes 1: Deep matrix factorization using Apache MXNet

Deep matrix factorization using Apache MXNet ( notes from Oreilly , github notebook ) Recommendation engines are widely used models th...