Monday, November 23, 2020

Python Study notes: example of using H2O/AutoML

here is the instruction to install H2O in python:
Use H2O directly from Python
1. Prerequisite: Python 2.7.x, 3.5.x, or 3.6.x
2. Install dependencies (prepending with `sudo` if needed):

pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
Conda Installation: Available at https://anaconda.org/h2oai/h2o/
To install this package with conda run:
conda install -c h2oai h2o
At the command line, copy and paste these commands one line at a time:
# The following command removes the H2O module for Python.

pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zermelo/2/Python/h2o-3.32.0.2-py2.py3-none-any.whl
Some example of code using python/H2O for machine learning modles, originally from here:

import pandas as pd
import numpy as np
import h2o
pd.set_option('display.width', 5000)

#The first thing you should do is to start the H2O. 
#You can run method h2o.init() to initialize H2O
h2o.init()
#You can see that the output from this method consists of
#some meta-information about your H2O cluster.
bank_df = h2o.upload_file("user//bank-additional/bank-additional-full.csv")

#Looking at the type of this variable, 
#we can see that the type is h2o.frame.H2OFrame. 
#this is not a pandas object, but the own object of the H2O.
type(bank_df)

bank_df.shape
bank_df.names
bank_df.describe()

#convert python dataframe to h2O frame
import pandas as pd
df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

#specify names
col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']
hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)

# show 6th row
print(bank_df[5,:])
# show 6-7 rows
print(bank_df[5:7,:])
# show first 4 columns from 6-7 rows
print(bank_df[5:7,0:4])
# show job,education and y columns from 6-7 rows
print(bank_df[5:7, ['job', 'education', 'y']])

x = bank_df.names
x.remove("y")
print(x)
y = "y" 
#The first model #Now let's train certain model. #First, we need to split our dataset into training and testing parts. #H2O allows to do this by using function split_frame().
train, test = bank_df.split_frame([0.7], seed=42)

from h2o.estimators import H2ORandomForestEstimator

rf = H2ORandomForestEstimator(ntrees=200)
rf.train(x=x,
         y=y,
         training_frame=train,
         validation_frame=test)

print(rf)
#A lot of interesting and useful information is available here. You can notice two blocks of information. The first one is reported on the train set and the second is about test set. There are different model performance metrics (MSE, RMSE, LogLoss, AUC, Gini etc.). Confusion matrix is a very interesting metric for error analysis. H2O allows to look at confusion matrices both on the train and test set.

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.057228614446939205
RMSE: 0.2392250288889923
LogLoss: 0.1832959078306571
Mean Per-Class Error: 0.11324444149577517
AUC: 0.9421770006956457
AUCPR: 0.6371759180029333
Gini: 0.8843540013912914

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3274014892227185: 
               no     yes   Error               Rate
0     no  23749.0  1963.0  0.0763   (1963.0/25712.0)
1    yes    742.0  2481.0  0.2302     (742.0/3223.0)
2  Total  24491.0  4444.0  0.0935   (2705.0/28935.0)
...

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.05856774862159308
RMSE: 0.24200774496200134
LogLoss: 0.1840719589577895
Mean Per-Class Error: 0.11586154049350128
AUC: 0.9420647034259153
AUCPR: 0.6551751738303531
Gini: 0.8841294068518306

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2643425350574156: 
               no     yes   Error               Rate
0     no   9818.0  1018.0  0.0939   (1018.0/10836.0)
1    yes    254.0  1163.0  0.1793     (254.0/1417.0)
2  Total  10072.0  2181.0  0.1038   (1272.0/12253.0)

predictions=rf.predict(test)
(predictions["predict"] == test["y"]).mean()
##************************************************ #AutoML #H2O provides the ability to perform automated machine learning. #The process is very simple and is oriented on the users without much knowledge and experience in machine learning. #AutoML will iterate through different models and parameters trying to find the best. #There are several parameters to specify, but in most cases all you need to do #is to set only the maximum runtime in seconds or maximum number of models. #You can think about AutoML as something similar to GridSearch #but on the level of models rather than on the level of parameters.

from h2o.automl import H2OAutoML
autoML = H2OAutoML(max_runtime_secs=120)
autoML.train(x=x,
             y=y,
             training_frame=bank_df)

#We can take a look on the table with all tried models and their corresponding performance 
by checking the .leaderboard attribute of the autoML instance. 
#GBM with 0.94 AUC metric seems to be the best model here.

leaderboard = autoML.leaderboard
print(leaderboard)

#to show the best model:
autoML.leader

predictionAML = autoML.predict(test)
(predictionAML["predict"] == test["y"]).mean()
##************************************************* ##another example using h20
import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)
aml

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)
##************************************************ #The first algorithm we want to train using the neural network. #To use this model we need to import H2ODeepLearningEstimator from h2o.estimators.deeplearning module. #Then, we need to create an instance of this estimator. #Like in the previous example with Random Forest, #here you can pass many different parameters to control the model and training process. #It is important to set up the architecture of the neural network. #In the parameter hidden we pass a list with a number of neurons in hidden layers. #So, this parameter controls both the number of hidden layers and neurons in these layers. #We set up 3 hidden layers with 100, 10 and 4 neurons in each respectively. #Also, we set the activation function to be Tanh.
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
dl = H2ODeepLearningEstimator(hidden=[100, 10, 4],activation='Tanh')
dl.train(x=x, y=y, training_frame=train, validation_frame=test)
predictions_dl = dl.predict(test)
print((predictions_dl["predict"] == test["y"]).mean())
#We can see that the accuracy is slightly lower than with Random Forest. Maybe we can fine-tune the model's parameters to get better performance. #In the next few cells we train linear model. Binomial family means that we want to perform classification with logistic regression. lambda_search allows searching optimal regularization parameter lambda.

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
lm = H2OGeneralizedLinearEstimator(family="binomial",
                                   lambda_search=True)
lm.train(x=x,
         y=y,
         training_frame=train,
         validation_frame=test)

predictions_lm = lm.predict(test)
print((predictions_lm["predict"] == test["y"]).mean())

#The last model we want to use here is the Gradient Boosting algorithm. 
With the default parameters it can give the best results amongst all other algorithms

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gb = H2OGradientBoostingEstimator()
gb.train(x=x,
         y=y,
         training_frame=train,
         validation_frame=test)

predictions_gb = gb.predict(test)
print((predictions_gb["predict"] == test["y"]).mean())
#It is worth to mention about XGBoost integration in the H2O platform. XGBoost is one of the most powerful algorithms which implements the gradient boosting idea. You can install it standalone, but it is also very convenient to use XGBoost in H2O. In the cell below you can see how to create an instance of H2OXGBoostEstimator and how to train it. You should understand that XGBoost uses many parameters and very often can be very sensitive to the changes in these parameters.

from h2o.estimators.xgboost import H2OXGBoostEstimator
xgb = H2OXGBoostEstimator()

param = {
         "ntrees" : 400,
         "max_depth" : 4,
         "learn_rate" : 0.01,
         "sample_rate" : 0.4,
         "col_sample_rate_per_tree" : 0.8,
         "min_rows" : 5,
         "seed": 4241,
         "score_tree_interval": 100
         }
predictions_xgb = xgb.predict(test)
print((predictions_xgb["predict"] == test["y"]).mean())
#Cross validation in H2O #Cross validation is one of the core techniques used in machine learning. The basic idea is to split the dataset into several parts (folds) and then train the model on all except one fold, which will be used later for testing. On this, the current iteration finishes and the next iteration begins. On the next iteration, the testing fold is included in the training sample. Instead, certain fold from the previous training set is used for testing. #For example, we split the dataset into 3 folds. On first iteration we use 1st and 2nd folds for training and 3rd for testing. On the second iteration 1st and 3rd folds are used for training and 2nd for testing. On the third iteration the 1st folds is used for testing and the 2nd and 3rd are used for training. #Cross validation allows to estimate the model's performance in a more accurate and reliable way. #In H2O it is simple to do cross validation. If the model supports it, there is an optional parameter nfolds which can be passed when creating an instance of the model. You should specify the number of folds for cross validation using this parameter. #H2O builds nfolds + 1 models. An additional model is trained on all the available data. This is the main model you will get as the result of training. #Let's train Random Forest and perform cross validation with 3 folds. Note that we are not passing the validation (test) set, but the entire dataset.

rf_cv = H2ORandomForestEstimator(ntrees=200, nfolds=3)
rf_cv.train(x=x, y=y, training_frame=bank_df)
Model tuning using GridSearch Often, you need to try many different parameters and their combinations to find one which produces the best performance of the model. It is hard and sometimes tedious to do everything by hand. Grid search allows to automate this process. All you need to do is to specify the set of hyperparameters you want to try and run the GridSearch instance. The system will try all possible combinations of the parameters (train and test models for each combination). Let's look how this tool can be used in H2O. First, you need to import the instance of the GridSearch object: from h2o.grid.grid_search import H2OGridSearch Now you need to specify all possible parameters which you want to try. We are going to search for an optimal combination of parameters for the XGBoost model we have built earlier. The parameters are placed inside a python dictionary where the keys are the names of the parameters and the values are the lists with possible values of these parameters. xgb_parameters = {'max_depth': [3, 6], 'sample_rate': [0.4, 0.7], 'col_sample_rate': [0.8, 1.0], 'ntrees': [200, 300]} The next step is a creation of the GridSearch instance. You should pass a model, id of the grid, and the dictionary with hyperparameters. xgb_grid_search = H2OGridSearch(model=H2OXGBoostEstimator, grid_id='example_grid', hyper_params=xgb_parameters) Eventually, you can run the grid search. Note that we set the higher learning rate, because grid search is a very time-consuming process. The number of models to train grows rapidly with the growth of number of hyperparameters. So, taking into account that this is only a learning example, we don't want to test many hyperparameters. xgb_grid_search.train(x=x, y=y, training_frame=train, validation_frame=test, learn_rate=0.3, seed=42) We can get the results of the grid search by using method get_grid() of the GridSearch instance. We want to sort the results by accuracy metric in the descending order. grid_results = xgb_grid_search.get_grid(sort_by='accuracy', decreasing=True) print(grid_results) You can see that the highest accuracy is obtained by using combination of 1.0 column sample rate, 0.4 sample rate, 200 trees and the maximum depth of one tree equal to 3.

No comments:

Post a Comment

NLP study notes:

word embeddding: collective term of models that learned to map a set of words or phrases in a vocabulary to vectors of numrical values. Ne...