Saturday, February 13, 2021

Data Science Study Notes: reinforcement learning

Terminology: State vs Action vs Policy vs Reward vs State Transition. Policy function is probabality density function(PDF), policy network: use a neural network to approxiamate the policy function.
Policy-based reinforcement learning:
Policy-based reinforcement learninga algorithm:
Policy-based method: if a good policy function pai is known, the agent can be controlled by the policy: randomly sample a_t from the policy function pai. However, in reality, we don't know that, in fact, we were trying to get that. So we can approximate the policy function pai by the policy network, which is why the deep learning neural network/convolution/dense coming to play. When searching for the best policy, use the policy gradient algorithm to maximum the expecation of the reward.

The relationship:
Actor=Athelete sports players. Critic=Referee. How do we train the player to get the champion? The athelete needs a lot of practice and training, how does the player know he is getting better? It has to go through by referee! for the immediate feedback.

However, the referee themselve in the beginning might not know exactly what are the best actions, so they also need to get trained, which are good actions with high performance, and which are not. After we have trained the value network(critic) for the referee, then use that network to train the player.
How do we train the athelete(the policy network=actor):
How do we train the referee/critic together:
Train two networks: policy network(player) and value network(critic), what's happening during the training and what's after:
Train the network -1:
Train the network -2:
Train the network -3:
Value-based reinforcement learning:
Gym is a toolkit for developing and comparing reinforment learning algorithms. Classic control problem: cart pole, Pendulum, MujoCo(continuous control task, Humanoid walk continuously etc)

The Tutorial is based on reinforcement learning grandmaster Dr Wang.

Monday, February 1, 2021

GCP Study notes: Biquery API SQL examples

In case if you are using GCP bigquery to prepare dataset, here are some examples:

--check the column data type without clicking the data: 
WHERE  table_name="table_name_c";

#use Bigquery API select statement via unnest for record row: 
FROM `firebase-public-project.analytics_153293282.events_20180915`
WHERE event_name = "level_complete_quickplay"

#You might get error: cannot access field key  on a value with type
SELECT event_name, param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

SELECT event_name, param.value.int_value AS score
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

drop table if exists `project_id1.Dataschema1.table_name1`;
create table `project_id1.Dataschema1.table_name1`
as select distinct var1,var2
from `project_id1.Dataschema1.table_name_a` as a 
inner join `project_id1.Dataschema1.table_name_b`  as trans
on  cast(LPAD(cast(a.cntycd as string), 5, '0') as string)=trans.cntycd ;
--paddle/replace with leading 0 in a string.  

Sunday, January 3, 2021

Python Study notes: How HDBSCAN works?

HDBSCAN is a clustering algorithm. It extends DBSCAN(Density-Based Spatial Clustering of Applications with Noise)by converting it into a hierarchical clustering algorithm. The main concept of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. Here are some good tutorial. the steps of DBSCAN algorithm:
The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.

If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.

If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.

The above process continues until the density-connected cluster is completely found.

The process restarts with a new point which can be a part of a new cluster or labeled as noise.

The notes are taken from the notebook.
#import package and setup the display
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data
%matplotlib inline
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

#load the data:
moons, _ = data.make_moons(n_samples=50, noise=0.05)
blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)
test_data = np.vstack([moons, blobs])
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)
Time to import the hdbscan package and run the hierarchical clustering algorithm.
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
So now that we have clustered the data -- what actually happened? We can break it out into a series of steps:

Transform the space according to the density/sparsity.
Build the minimum spanning tree of the distance weighted graph.
Construct a cluster hierarchy of connected components.
Condense the cluster hierarchy based on minimum cluster size.
Extract the stable clusters from the condensed tree.

Core distance for a point x is defined for parameter k for a point x and denote as corek(x), via kth nearest neighbor. In other words, draw the circle to cover k nearest data points, the radius of that circle is the core distance.

Mutual reachability distance: now with 2 circles for each data point, find the maximum radius to cover both circles:


#Any point not in a selected cluster is simply a noise point(assigned the label -1)
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat) 
                  if col >= 0 else (0.5, 0.5, 0.5) for col, sat in 
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)                                      
Parameter Selection for HDBSCAN

1. Selecting min_cluster_size: the smallest size grouping that you wish to consider a cluster.

digits = datasets.load_digits()
data =
projection = TSNE().fit_transform(data)
plt.scatter(*projection.T, **plot_kwds)

#start with a min_cluster_size of 15
clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

#Increasing the min_cluster_size to 30 
#reduces the number of clusters, merging some together.
2. Selecting min_samples. The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

Monday, November 23, 2020

Python Study notes: example of using H2O/AutoML

here is the instruction to install H2O in python:
Use H2O directly from Python
1. Prerequisite: Python 2.7.x, 3.5.x, or 3.6.x
2. Install dependencies (prepending with `sudo` if needed):

pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
Conda Installation: Available at
To install this package with conda run:
conda install -c h2oai h2o
At the command line, copy and paste these commands one line at a time:
# The following command removes the H2O module for Python.

pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install
Some example of code using python/H2O for machine learning modles, originally from here:

import pandas as pd
import numpy as np
import h2o
pd.set_option('display.width', 5000)

#The first thing you should do is to start the H2O. 
#You can run method h2o.init() to initialize H2O
#You can see that the output from this method consists of
#some meta-information about your H2O cluster.
bank_df = h2o.upload_file("user//bank-additional/bank-additional-full.csv")

#Looking at the type of this variable, 
#we can see that the type is h2o.frame.H2OFrame. 
#this is not a pandas object, but the own object of the H2O.


#convert python dataframe to h2O frame
import pandas as pd
df = pd.DataFrame({'col1': [1,1,2], 'col2': ['César Chávez Day', 'César Chávez Day', 'César Chávez Day']})
hf = h2o.H2OFrame(df)

#specify names
col_dtypes = {'col1_name':col1_type, 'col2_name':col2_type}
na_list = ['NA', 'none', 'nan', 'etc']
hf = h2o.H2OFrame(df, column_types=col_dtypes, na_strings=na_list)

# show 6th row
# show 6-7 rows
# show first 4 columns from 6-7 rows
# show job,education and y columns from 6-7 rows
print(bank_df[5:7, ['job', 'education', 'y']])

x = bank_df.names
y = "y" 
#The first model #Now let's train certain model. #First, we need to split our dataset into training and testing parts. #H2O allows to do this by using function split_frame().
train, test = bank_df.split_frame([0.7], seed=42)

from h2o.estimators import H2ORandomForestEstimator

rf = H2ORandomForestEstimator(ntrees=200)

#A lot of interesting and useful information is available here. You can notice two blocks of information. The first one is reported on the train set and the second is about test set. There are different model performance metrics (MSE, RMSE, LogLoss, AUC, Gini etc.). Confusion matrix is a very interesting metric for error analysis. H2O allows to look at confusion matrices both on the train and test set.

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.057228614446939205
RMSE: 0.2392250288889923
LogLoss: 0.1832959078306571
Mean Per-Class Error: 0.11324444149577517
AUC: 0.9421770006956457
AUCPR: 0.6371759180029333
Gini: 0.8843540013912914

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3274014892227185: 
               no     yes   Error               Rate
0     no  23749.0  1963.0  0.0763   (1963.0/25712.0)
1    yes    742.0  2481.0  0.2302     (742.0/3223.0)
2  Total  24491.0  4444.0  0.0935   (2705.0/28935.0)

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.05856774862159308
RMSE: 0.24200774496200134
LogLoss: 0.1840719589577895
Mean Per-Class Error: 0.11586154049350128
AUC: 0.9420647034259153
AUCPR: 0.6551751738303531
Gini: 0.8841294068518306

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2643425350574156: 
               no     yes   Error               Rate
0     no   9818.0  1018.0  0.0939   (1018.0/10836.0)
1    yes    254.0  1163.0  0.1793     (254.0/1417.0)
2  Total  10072.0  2181.0  0.1038   (1272.0/12253.0)

(predictions["predict"] == test["y"]).mean()
##************************************************ #AutoML #H2O provides the ability to perform automated machine learning. #The process is very simple and is oriented on the users without much knowledge and experience in machine learning. #AutoML will iterate through different models and parameters trying to find the best. #There are several parameters to specify, but in most cases all you need to do #is to set only the maximum runtime in seconds or maximum number of models. #You can think about AutoML as something similar to GridSearch #but on the level of models rather than on the level of parameters.

from h2o.automl import H2OAutoML
autoML = H2OAutoML(max_runtime_secs=120)

#We can take a look on the table with all tried models and their corresponding performance 
by checking the .leaderboard attribute of the autoML instance. 
#GBM with 0.94 AUC metric seems to be the best model here.

leaderboard = autoML.leaderboard

#to show the best model:

predictionAML = autoML.predict(test)
(predictionAML["predict"] == test["y"]).mean()
##************************************************* ##another example using h20
import h2o
from h2o.automl import H2OAutoML


# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("")
test = h2o.import_file("")

# Identify predictors and response
x = train.columns
y = "response"

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)
##************************************************ #The first algorithm we want to train using the neural network. #To use this model we need to import H2ODeepLearningEstimator from h2o.estimators.deeplearning module. #Then, we need to create an instance of this estimator. #Like in the previous example with Random Forest, #here you can pass many different parameters to control the model and training process. #It is important to set up the architecture of the neural network. #In the parameter hidden we pass a list with a number of neurons in hidden layers. #So, this parameter controls both the number of hidden layers and neurons in these layers. #We set up 3 hidden layers with 100, 10 and 4 neurons in each respectively. #Also, we set the activation function to be Tanh.
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
dl = H2ODeepLearningEstimator(hidden=[100, 10, 4],activation='Tanh')
dl.train(x=x, y=y, training_frame=train, validation_frame=test)
predictions_dl = dl.predict(test)
print((predictions_dl["predict"] == test["y"]).mean())
#We can see that the accuracy is slightly lower than with Random Forest. Maybe we can fine-tune the model's parameters to get better performance. #In the next few cells we train linear model. Binomial family means that we want to perform classification with logistic regression. lambda_search allows searching optimal regularization parameter lambda.

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
lm = H2OGeneralizedLinearEstimator(family="binomial",

predictions_lm = lm.predict(test)
print((predictions_lm["predict"] == test["y"]).mean())

#The last model we want to use here is the Gradient Boosting algorithm. 
With the default parameters it can give the best results amongst all other algorithms

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gb = H2OGradientBoostingEstimator()

predictions_gb = gb.predict(test)
print((predictions_gb["predict"] == test["y"]).mean())
#It is worth to mention about XGBoost integration in the H2O platform. XGBoost is one of the most powerful algorithms which implements the gradient boosting idea. You can install it standalone, but it is also very convenient to use XGBoost in H2O. In the cell below you can see how to create an instance of H2OXGBoostEstimator and how to train it. You should understand that XGBoost uses many parameters and very often can be very sensitive to the changes in these parameters.

from h2o.estimators.xgboost import H2OXGBoostEstimator
xgb = H2OXGBoostEstimator()

param = {
         "ntrees" : 400,
         "max_depth" : 4,
         "learn_rate" : 0.01,
         "sample_rate" : 0.4,
         "col_sample_rate_per_tree" : 0.8,
         "min_rows" : 5,
         "seed": 4241,
         "score_tree_interval": 100
predictions_xgb = xgb.predict(test)
print((predictions_xgb["predict"] == test["y"]).mean())
#Cross validation in H2O #Cross validation is one of the core techniques used in machine learning. The basic idea is to split the dataset into several parts (folds) and then train the model on all except one fold, which will be used later for testing. On this, the current iteration finishes and the next iteration begins. On the next iteration, the testing fold is included in the training sample. Instead, certain fold from the previous training set is used for testing. #For example, we split the dataset into 3 folds. On first iteration we use 1st and 2nd folds for training and 3rd for testing. On the second iteration 1st and 3rd folds are used for training and 2nd for testing. On the third iteration the 1st folds is used for testing and the 2nd and 3rd are used for training. #Cross validation allows to estimate the model's performance in a more accurate and reliable way. #In H2O it is simple to do cross validation. If the model supports it, there is an optional parameter nfolds which can be passed when creating an instance of the model. You should specify the number of folds for cross validation using this parameter. #H2O builds nfolds + 1 models. An additional model is trained on all the available data. This is the main model you will get as the result of training. #Let's train Random Forest and perform cross validation with 3 folds. Note that we are not passing the validation (test) set, but the entire dataset.

rf_cv = H2ORandomForestEstimator(ntrees=200, nfolds=3)
rf_cv.train(x=x, y=y, training_frame=bank_df)
Model tuning using GridSearch Often, you need to try many different parameters and their combinations to find one which produces the best performance of the model. It is hard and sometimes tedious to do everything by hand. Grid search allows to automate this process. All you need to do is to specify the set of hyperparameters you want to try and run the GridSearch instance. The system will try all possible combinations of the parameters (train and test models for each combination). Let's look how this tool can be used in H2O. First, you need to import the instance of the GridSearch object: from h2o.grid.grid_search import H2OGridSearch Now you need to specify all possible parameters which you want to try. We are going to search for an optimal combination of parameters for the XGBoost model we have built earlier. The parameters are placed inside a python dictionary where the keys are the names of the parameters and the values are the lists with possible values of these parameters. xgb_parameters = {'max_depth': [3, 6], 'sample_rate': [0.4, 0.7], 'col_sample_rate': [0.8, 1.0], 'ntrees': [200, 300]} The next step is a creation of the GridSearch instance. You should pass a model, id of the grid, and the dictionary with hyperparameters. xgb_grid_search = H2OGridSearch(model=H2OXGBoostEstimator, grid_id='example_grid', hyper_params=xgb_parameters) Eventually, you can run the grid search. Note that we set the higher learning rate, because grid search is a very time-consuming process. The number of models to train grows rapidly with the growth of number of hyperparameters. So, taking into account that this is only a learning example, we don't want to test many hyperparameters. xgb_grid_search.train(x=x, y=y, training_frame=train, validation_frame=test, learn_rate=0.3, seed=42) We can get the results of the grid search by using method get_grid() of the GridSearch instance. We want to sort the results by accuracy metric in the descending order. grid_results = xgb_grid_search.get_grid(sort_by='accuracy', decreasing=True) print(grid_results) You can see that the highest accuracy is obtained by using combination of 1.0 column sample rate, 0.4 sample rate, 200 trees and the maximum depth of one tree equal to 3.

Friday, October 23, 2020

Python Study Notes: how to load stock data, manipulate data, find patterns for profit?


from pandas_datareader import data as pdr
#run the upgrade if see error: pandas_datareader, ImportError: cannot import name 'urlencode'
#pip install pandas-datareader --upgrade

#import fix_yahoo_finance as yf
#yf.pdr_override() # -- that's all it takes :-)

import matplotlib.pyplot as plt
#from import  candlestick2_ohlc
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import quantile_transform

import numpy as np
#from import DataReader
from pandas_datareader import DataReader
from datetime import timedelta
from stockstats import StockDataFrame
from datetime import datetime
import time
import matplotlib.pyplot as plt
import pandas as pd
import random
#import tabula

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())
#pysql = lambda q: sqldf(q, locals())

pd.set_option('display.max_rows', None) 
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.width', 150)

import datetime


#here is the code to see the change of VIX  
for symb in ('SPY','VXX','^VIX'):
    panel_data = pdr.get_data_yahoo(symb, start=date2, end=date1)
    df0['date2'] = df0.index
    df0.columns = ['high','low','open', 'close','volume','adj_close','date2'] 
    df3=df2[((df2.end_x>=df2.end_y-pd.to_timedelta(50,unit='D')) & (df2.end_x

Here is the output you can see from SPY performance after the 1st presidential debate:

Here is the output you can see from VIX performance after the 1st presidential debate:
for symb in ('SPY','^VIX'): panel_data = pdr.get_data_yahoo(symb, start=date2, end=date1) time.sleep(1.5) panel_data.head() panel_data.tail() panel_data.shape list(panel_data) df0=pd.DataFrame(panel_data) num0=len(df0) list(df0.columns) df0['date2'] = df0.index df0.columns = ['high','low','open', 'close','volume','adj_close','date2'] ################################################## #file=r'C:\Users\skuang\Google_Drive\Python\data\LABU.csv' #test = pd.read_csv(file) #test.shape #test.tail() #test.columns=['date2','open','high','low','close','adj_close','volume'] #test['symbol']='LABU' #test.dtypes #df0=test[['date2','open','low','high','close','volume','symbol']][test['symbol']=='LABU'] #df0.dtypes import datetime df0['symb']=symb df0['weekday']=pd.to_datetime(df0['date2']).dt.weekday+1 df0['year']=pd.to_datetime(df0['date2']).dt.year df0['L9'] = df0['low'].rolling(window=9).min() df0['H9'] = df0['high'].rolling(window=9).max() df0['RSV'] = 100*((df0['close'] - df0['L9']) / (df0['H9'] - df0['L9']) ) df0['k'] = df0['RSV'].rolling(window=3).mean() df0['d'] = df0['k'].rolling(window=3).mean() #df0['close10'] = df0['close'].rolling(window=8).mean() df0['close22'] = df0['close'].rolling(window=22).mean() df0['close60'] = df0['close'].rolling(window=60).mean() df0['close250'] = df0['close'].rolling(window=250).mean() df0['change_22']=df0.close/df0.close22-1 df0['change_60']=df0.close/df0.close60-1 df0['change_250']=df0.close/df0.close250-1 # Converting date to pandas datetime format #df0['Date'] = pd.to_datetime(df['Date']) # Getting week number df0['week'] = df0['date2'].dt.week df0['gap_250']=round(df0.change_250/0.05,0)*0.05 df0['daily_change']=df0.close/df0.close.shift(1)-1 df0['daily_change1']=round(df0['daily_change'],2) df0['ret_mon_1']=df0.close.shift(-22)/df0.close-1 df0['ret_mon_2']=df0.close.shift(-45)/df0.close-1 df0['ret_mon_3']=df0.close.shift(-66)/df0.close-1 df0['ret_mon_6']=df0.close.shift(-130)/df0.close-1 df0['ret_mon_9']=df0.close.shift(-200)/df0.close-1 df0['ret_mon_12']=df0.close.shift(-255)/df0.close-1 df0['ret_mon_24']=df0.close.shift(-560)/df0.close-1 df0['ret_mon_36']=df0.close.shift(-765)/df0.close-1 df0['ret_mon_48']=df0.close.shift(-1020)/df0.close-1 df0['ret_mon_60']=df0.close.shift(-1270)/df0.close-1 df0[['date2','daily_change','daily_change1','close']].tail() df0.columns #check the return distribution after 22/45/66/130,250,500, 750,1000,1250 business days r0=df0.copy() r0[(r0.weekday==5)].shape ret_mean=r0[(r0.weekday==5)].groupby(['symb','gap_250'],as_index=False).agg({ "week": "count", "ret_mon_3": [("avg",'mean')],"ret_mon_6": [("avg",'mean')],"ret_mon_12": [("avg",'mean')], "ret_mon_24": [("avg",'mean')], "ret_mon_36": [("avg",'mean')], "ret_mon_48": [("avg",'mean')], "ret_mon_60": [("avg",'mean')] }) ret_mean.columns = ['_'.join(col) for col in ret_mean.columns] list(ret_mean.columns) ret_mean z0=z0.append(ret_mean) curr0=curr0.append(df0[( (df0.year>=2020) & (df0.weekday==5) )][['symb','date2','year','close', 'close250', 'change_22', 'change_60', 'change_250','gap_250']].tail(1)) df1 = StockDataFrame.retype(df0) macd=pd.DataFrame(df1.get('macd')) macd.head() daily0=daily0.append(df1[['symb','date2','week', 'weekday','year','open','low','high','close','daily_change1','k','d','macd','macds', 'macdh' ]]) # daily0=daily0.append(df0[ (df0.weekday==5) ][['symb','date2','week', 'weekday','year','close', 'close250', 'change_22', 'change_60', 'change_250','gap_250']]) #t1=df0[(df0.weekday==5) & (df0.week % 2 == 0 ) ][['date2','symb','close', 'week', 'weekday', 'year', 'close60', 'close250', 'change_250']] #t1['upper250']=np.where(t1.change_250< -0.099, -t1.change_250*10,1) t1=df0[['symb','date2','week', 'weekday','year','close', 'gap_250','daily_change','daily_change1']] t1['flag']=np.where(t1['daily_change']<-0.01, 1,0) t1['input']=np.where(t1['daily_change']<-0.01,abs(t1['daily_change1'])*50000,0) t1.tail(10) t1a=t1.groupby(by=['year'],as_index=False).agg({"input":"sum"}).sort_values('year') t1a #t1['input']=1000*t1['upper250'] # t1['input']=1000 t1['share']=t1.input/t1.close #rollnum=26*5 rollnum=250*15 t1['shares'] = t1['share'].rolling(min_periods=1,window=rollnum).sum() t1['cost'] = t1['input'].rolling(min_periods=1,window=rollnum).sum() t1['shares_year_early']=t1['shares'].shift(250) t1['value_now']=t1['shares'] *t1.close t1['value_change']=t1['value_now']-t1.cost t1['pcnt_gain']=t1['value_change']/t1['cost'] t1['cummu_return_per_year']=np.power(t1['pcnt_gain']+1,1/15)-1 t1['value_b4']=(t1['value_now'].shift(250)) t1['gl_peryear']=(t1['value_now']-t1['value_now'].shift(250)).fillna(1).astype(int).apply(lambda x: f'{x:,}') t1['value_now']=t1['value_now'].fillna(1).astype(int).apply(lambda x: f'{x:,}') t1['value_b4']=t1['value_b4'].fillna(1).astype(int).apply(lambda x: f'{x:,}') t1['shares']=t1['shares'].fillna(1).astype(int).apply(lambda x: f'{x:,}') t1['value_change']=t1['value_change'].fillna(1).astype(int).apply(lambda x: f'{x:,}') t1.tail() t1[(t1.week==2) & (t1.weekday==5)].tail(30)[['symb','year','shares','cost', 'value_now','value_change','pcnt_gain','cummu_return_per_year']] #t1[(t1.week==2)].tail(17).pcnt_gain.describe() t1[(t1.week==14)].tail(30)[['symb','year','shares','cost', 'value_now','value_change','pcnt_gain','value_b4', 'gl_peryear']] summ_year=summ_year.append(t1[(t1.week==2) &(t1.cost>50000)].tail(30)[['symb','year','shares','cost', 'value_now','value_change','pcnt_gain','value_b4', 'gl_peryear']]) print('size for symbol is ',symb, ' : ',ret_mean.shape) z0.head() z0.tail() daily0.shape #================================================

Tuesday, September 22, 2020

Data Science Study Notes: Automatic Machine Learning (AutoML)

Automated Machine Learning (AutoML) is one of the hottest topics in data science today, but what does it mean?

Why the random search is much more faster than the grid search, and at the same time, not missing the optimal point. Here is the beautiful formula we need to achieve:
where is the formula coming from?
So you can apply the typical condition, 95% of chance falling in the optimal region, then we know as long as we random sample 60 times, we have 95% of chance to land the optimal region:
Here is the grach to understand why the random search is likely to achieve the optimal region with much less time of trails? It's essentially due to the fact that there are usually not that many important factors for the model, in other words, only a few important factors that worthy to grid search, all the other searches are essentially a waste for the un-important factors:
here is the video from Danny Leybzon. He has an academic background in computational statistics, is a grandmaster in AutoML.

Friday, September 18, 2020

Data Science Study Notes: How do we detect outliers (anomaly records) in python?

Question: How do we get outliers from python? How do we rank column by certain group?
In this blogger we will cover what's the major difference for different approaches to detect outliers (anomaly records)  in python, and what's the best approach.  We know there are quite a few ways to do outlier detection, including the following:

1. IQR methold, Distiance-based approach(Mahalanobis distance, one-Class SVM, IsolationForest algorithm in python etc), density-based approach(Local Outlier Factor), those approaches are based on spatial proximity; in other words, they measure how far away from the observation to the other cluster of observations, using either one-dimention, or multiple dimentions.

Mahalanobis distance, which is a distiance-based approach. It is the distance between a point and a distribution. And not between two distinct points. It is effectively a multivariate equivalent of the Euclidean distance.
It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical applications ever since. However, it’s not so well known or used in the machine learning practice.

Something special about Mahalanobis distance, if two columns are perfectly correlated, then you might have difficulties to calculate the inverse matrix, in that case, you might want to remove that perfectly correlated column.

Here are some sample code in python:
df_x = y0[[  'var1', 'var2',   'var3' ]].copy()
list_2remove = []      
for list1 in list_col0:
    if ( (df_x1a[list1].isnull().sum()==df_x1a.shape[0]) | (np.isnan(df_x1.loc[:,list1].var()) )  ):

print('list to remove:  ', list_2remove)     
df_x1.fillna(0.5, inplace=True)

list_2remove = []
for i in range(len(list_col1)):
    for j in range(i+1,len(list_col1)):
        if  abs(abs(np.corrcoef(df_x1[list(list_col1)[i]], df_x1[list(list_col1)[j]])[0,1]) -1)<0.0001 :
print('strong correlate list to remove:  ', list_2remove)            
if df_x1.shape[1]>1:
    df_x1['mahala'] = mahalanobis(x=df_x1, data=df_x1)
    y1=pd.merge(y0,df_x1,left_index=True, right_on='id',how='left')

2. Deviation-based Model approach(Cook Distance, Studentized Residual), you need to build a statistical model first to detect those outliers.

#generate the percentile for all the values in the columns:
df['pct_rank'] = df['Number_legs'].rank(pct=True)
#output will be decimal between 0.00-1, or it will have value 1-100(%).
#use metho='min' to catch the minimal percentile for the equal cases.
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
#treat the NaN ones as the botton of the rank(the largest).

df["rank"] = df.groupby("group_ID")["value"].rank("dense", ascending=False) 
>>> df
     group_ID   item_ID  value  rank
0  0S00A1HZEy      AB     10     2
1  0S00A1HZEy      AY      4     3
2  0S00A1HZEy      AC     35     1 

#we can also use qcut function in python to rank columns, for example
data1['ranks'] = data1.groupby('group_var')['column_var'].transform(pd.qcut,3, labels=False,duplicates='drop')
#3 means rank of 0,1,2.

#remove any value that is more than certain percentile
numpy.percentile(df.a,95) is much faster than use: df.a.quantile(.95)
t2b= t2b[(t2b["ratio_list"] < np.percentile(t2b["ratio_list"],97.5))] 
The main difference between those 2 approaches: those IQR methold, density-based/distance-based approaches, catch the outlier that is pretty much the extreme values on one or multiple dimensions. Meanwhile, the Model approach(Cook Distance, Standardized Residual) are more focused on the distance to the fitting curve, instead of the distance to the raw data cluster itself.

For instance, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the high end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

Now you can easily see the pros and cons for those 2 approaches. The 1st one with the relative simple dimensional distance approach, is more generic to catch the outlier, not to specific model. The 2nd one based on the specific model approach, is more powerful when we are trying to remove outliers to fit a particular model. In other words, if you are trying to clean the data before fitting your model, we prefer the 2nd model-based statistics approach.

Some intuitive review for the simple linear regression:
1. What's the goal/target when we are trying to do regression?
Answer: to minimize the square error via the "the least square" method, usually in math, taking the first derivative setup to be 0, then you can get the coefficients.

2. How do we know if it's linear regression?

Answer: staring with the visulization plot first, usually it will give some confidence of the linear regression. Then check the residual plot, there shouldn't be any warning of particular pattern coming out of the residula plot: for example, the residual getting bigger and bigger when x going up or down; or the residual plot has some particular shape(round, circle etc).

3. How do we quantify the results of linear regression?

Answer: use the coefficient of determination, r^2, in math: r^2=SSR/SST, while SST: total sum of square. SSR: sum of square from regression. SSE: sum of square from error. The variance explained by the regression model. When we have more predictors, ideally, the R^2 will get bigger, in that case we will use some other stats metrics: adjsuted R^2 to see the real impact, which will punish the num of extra parameters. There are some other ways for the penalizing more parameters for the potential overfitting issue, that is, Ridge(L1)/Lasso(square) regression .

4. How do we know the linear parameter are good/significant different from 0?

Answer: the standard error of the estimate: S_e^2=SSE/(n-2) follows the Chi-square distribution with df=n-2. Therefore, the slope: (b-0)/(S_e / sqrt(S_xx) ) follows a t-distribution of df=n-2.

5. What's one-way ANOVA?

Answer: use F-test to see whether there are significant difference for multiple groups of objects. One is the within group variance, the other is the Between/Across group variance.

Here is an example using IsolationForest algorithm in python.  
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
outlier = clf.predict(df_train)
df_train1['outlier'] =outlier 
here is a sample code to define the function to catch outliers:
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence

def catch_outlier(Some_ID):
    outlier1 = input_data[input_data.ID==Some_ID]
    ols = smf.ols('dep_var ~ var1_num  + var2_num +
     C(var_category, Treatment))', data=outlier1).fit()
#    print(ols.summary())
#    ols.params
#    ols.rsquared
#    start =  time.time()    
    output1= OLSInfluence(ols)
    outlier1['student_resid'] =student_resid
    outlier1['cooks_d'] =cooks_d[0]     
#    end = time.time()
#    print(round((end - start)/60,2), "mins" )   

    # cacluate outlier flags based on studentR and cooksD cutoffs    
    outlier1['studentr_flag'] = outlier1['student_resid'] >= 3
    outlier1['cooksd_flag'] = outlier1['cooks_d'] > 4/outlier1.shape[0]
    outlier1['outlier']=outlier1.studentr_flag & outlier1.cooksd_flag
    outlier2 =outlier1[outlier1['outlier']==1]    
    outlier3 =outlier2[['outlier','transaction_id','ID']]

    return outlier_all
Question: what is the difference between internal studentized residual and external studentized residual?
1. Studentized residual involves the standardization of (y-y_hat), the denorminator is sigma*sqrt(1-h_ii), where h_ii is the orthogonal projection to the column space. 

2. For the internal studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m) for all the observations. 

3. For the external studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m-1) for all the observations except ith observation itself. 

4. If the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, which is for the external studentized residual.

Here is the very useful tutorial about how to get the outlier statistics indicators.
Notice there are significant difference in running-time if you didn't chooce those carefully.
##==using summary_frame will take much longer than  OLSInfluence ===
oi = ols.get_influence()
#==summary_frame will generate a few more other unused stats  
#==Cook distance and DFFITS are essentially the same 
##==prefer using OLSInfluence,it's much quicker
 output1= OLSInfluence(ols)
 outlier1['student_resid'] =student_resid
 outlier1['cooks_d'] =cooks_d[0]     
#==Another comparison: using the get_ will much quicker    
#student_resid=output1.resid_studentized_external ##==take much longer 
  • IsolationForest approach, which is similar to the random forest methodology.
The main idea, which is different from other popular outlier detection methods, is that Isolation Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space).That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

Here are some sample code in Python:
from sklearn.ensemble import IsolationForest
# training the model
rng = np.random.RandomState(42)

clf = IsolationForest(n_estimators=100, max_samples='auto', 
max_features=1.0, bootstrap=False, 
n_jobs=None, behaviour='old',  verbose=0)


# predictions
y_pred = clf.predict(df_x1)


#output -1 for outliers, 1 for normal

#generate the score for each record: 
#using either score_samples() or ecision_function()

The idea of identifying a normal vs. abnormal observation can be observed in the following graph. A normal point (on the left) requires more partitions to be identified than an abnormal point (right).
Isolation Forest

If we need to build a regression model to predict our target variable, then the best approach is definitely the Deviation-based model approach. For instance, if you are trying to predict a car value, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the hign end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

If we have to choose between Mahalanobis distance and IsolationForest approach, my previous working experience tells me that Mahalanobis distance is more preferred than the isolationforest. It's more explainable to use  Mahalanobis distance than IsolationForest. 
Benford's law application in the outlier detection
Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small.

For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Here are the exact distribution table and graph.
In 1995, the phenomenon was finally proven by Hill. His proof was based on the fact that numbers in data series following the Benford’s Law are, in effect, “second generation” distributions, i.e. combinations of other distributions.

Benford's law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford's law applies. For instance, one can expect that Benford's law would apply to a list of numbers representing the populations of UK settlements. But if a "settlement" is defined as a village with population between 300 and 999, then Benford's law will not apply.

In the United States, evidence based on Benford's law has been admitted in criminal cases at the federal, state, and local levels.

Benford's law has been invoked as evidence of fraud in the 2009 Iranian elections,and also used to analyze other election results. However, other experts consider Benford's law essentially useless as a statistical indicator of election fraud in general.

It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, etc.

What is Cointegration?
Definition: If there exists a stationary linear combination of nonstationary random variables, the variables combined are said to be cointegrated.

stationary: A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary. On the other hand, a white noise series is stationary .

Looking at Autocorrelation Function (ACF) plots to check, or Use unit test(Dickey-Fuller Test) to check the p-value.

Example of Cointegration: the old woman and the boy are unrelated to one another, except that they are both on a random walk in the park. Information about the boy's location tells us nothing about the old woman's location.

The old man and the dog are joined by one of those leashes that has the cord rolled up inside the handle on a spring. Individually, the dog and the man are each on a random walk. They cannot wander too far from one another because of the leash. We say that the random processes describing their paths are cointegrated.

Data Science Study Notes: reinforcement learning

Terminology: State vs Action vs Policy vs Reward vs State Transition. Policy function is probabality density function(PDF), policy network:...