Friday, September 18, 2020

Data Science Study Notes: How do we detect outliers (anomaly records) in python?

Question: How do we get outliers from python? How do we rank column by certain group?
In this blogger we will cover what's the major difference for different approaches to detect outliers (anomaly records)  in python, and what's the best approach.  We know there are quite a few ways to do outlier detection, including the following:

1. IQR methold, Distiance-based approach(Mahalanobis distance, one-Class SVM, IsolationForest algorithm in python etc), density-based approach(Local Outlier Factor), those approaches are based on spatial proximity; in other words, they measure how far away from the observation to the other cluster of observations, using either one-dimention, or multiple dimentions.

Mahalanobis distance, which is a distiance-based approach. It is the distance between a point and a distribution. And not between two distinct points. It is effectively a multivariate equivalent of the Euclidean distance.
It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical applications ever since. However, it’s not so well known or used in the machine learning practice.

Something special about Mahalanobis distance, if two columns are perfectly correlated, then you might have difficulties to calculate the inverse matrix, in that case, you might want to remove that perfectly correlated column.

Here are some sample code in python:
df_x = y0[[  'var1', 'var2',   'var3' ]].copy()
df_x.shape
df_x.index
scaler.fit(df_x)
df_x1=pd.DataFrame(scaler.transform(df_x))
df_x1.columns=()
df_x1.shape
df_x1a=df_x1.corr()
list_col0=list(df_x1.columns)
  
list_2remove = []      
for list1 in list_col0:
    if ( (df_x1a[list1].isnull().sum()==df_x1a.shape[0]) | (np.isnan(df_x1.loc[:,list1].var()) )  ):
        list_2remove.extend([list1])

print('list to remove:  ', list_2remove)     
list_col1=list(set(list_col0)-set(list_2remove))
df_x1.shape
df_x1=df_x1[list_col1]
df_x1.fillna(0.5, inplace=True)

list_2remove = []
for i in range(len(list_col1)):
    for j in range(i+1,len(list_col1)):
        if  abs(abs(np.corrcoef(df_x1[list(list_col1)[i]], df_x1[list(list_col1)[j]])[0,1]) -1)<0.0001 :
            list_2remove.extend([list_col1[j]])
  
print('strong correlate list to remove:  ', list_2remove)            
list_col2=list(set(list_col1)-set(list_2remove))
list_col2=set(list_col1)-set(list_2remove)
df_x1.shape
df_x1=df_x1[list_col2]
  
if df_x1.shape[1]>1:
    df_x1['mahala'] = mahalanobis(x=df_x1, data=df_x1)
    df_x1['id']=df_x.index
    df_x1.head()
    df_x1.index
  
    y1=pd.merge(y0,df_x1,left_index=True, right_on='id',how='left')
    y0.shape
    list(y1.columns)
    y1.sort_values(by=['mahala'],ascending=False,inplace=True)
    list_col3=['puid','mahala','year5']+list(list_col2)
    y2=y1[list_col3]
    y2.reset_index(drop=True,inplace=True)
    y2['rank']=y2.index+1
    y2['pcnt']=1-(y2.index+1)/y2.shape[0]
    y2.head(10)


2. Deviation-based Model approach(Cook Distance, Studentized Residual), you need to build a statistical model first to detect those outliers.

#generate the percentile for all the values in the columns:
df['pct_rank'] = df['Number_legs'].rank(pct=True)
#output will be decimal between 0.00-1, or it will have value 1-100(%).
df['percent']=df['mahn_dist'].rank(ascending=True,method='min',pct=True)
#use metho='min' to catch the minimal percentile for the equal cases.
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
#treat the NaN ones as the botton of the rank(the largest).

df["rank"] = df.groupby("group_ID")["value"].rank("dense", ascending=False) 
>>> df
     group_ID   item_ID  value  rank
0  0S00A1HZEy      AB     10     2
1  0S00A1HZEy      AY      4     3
2  0S00A1HZEy      AC     35     1 

#we can also use qcut function in python to rank columns, for example
data1['ranks'] = data1.groupby('group_var')['column_var'].transform(pd.qcut,3, labels=False,duplicates='drop')
#3 means rank of 0,1,2.

#remove any value that is more than certain percentile
numpy.percentile(df.a,95) is much faster than use: df.a.quantile(.95)
t2b= t2b[(t2b["ratio_list"] < np.percentile(t2b["ratio_list"],97.5))] 
The main difference between those 2 approaches: those IQR methold, density-based/distance-based approaches, catch the outlier that is pretty much the extreme values on one or multiple dimensions. Meanwhile, the Model approach(Cook Distance, Standardized Residual) are more focused on the distance to the fitting curve, instead of the distance to the raw data cluster itself.

For instance, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the high end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

Now you can easily see the pros and cons for those 2 approaches. The 1st one with the relative simple dimensional distance approach, is more generic to catch the outlier, not to specific model. The 2nd one based on the specific model approach, is more powerful when we are trying to remove outliers to fit a particular model. In other words, if you are trying to clean the data before fitting your model, we prefer the 2nd model-based statistics approach.

Some intuitive review for the simple linear regression:
1. What's the goal/target when we are trying to do regression?
Answer: to minimize the square error via the "the least square" method, usually in math, taking the first derivative setup to be 0, then you can get the coefficients.

2. How do we know if it's linear regression?

Answer: staring with the visulization plot first, usually it will give some confidence of the linear regression. Then check the residual plot, there shouldn't be any warning of particular pattern coming out of the residula plot: for example, the residual getting bigger and bigger when x going up or down; or the residual plot has some particular shape(round, circle etc).

3. How do we quantify the results of linear regression?

Answer: use the coefficient of determination, r^2, in math: r^2=SSR/SST, while SST: total sum of square. SSR: sum of square from regression. SSE: sum of square from error. The variance explained by the regression model. When we have more predictors, ideally, the R^2 will get bigger, in that case we will use some other stats metrics: adjsuted R^2 to see the real impact, which will punish the num of extra parameters. There are some other ways for the penalizing more parameters for the potential overfitting issue, that is, Ridge(L1)/Lasso(square) regression .

4. How do we know the linear parameter are good/significant different from 0?

Answer: the standard error of the estimate: S_e^2=SSE/(n-2) follows the Chi-square distribution with df=n-2. Therefore, the slope: (b-0)/(S_e / sqrt(S_xx) ) follows a t-distribution of df=n-2.

5. What's one-way ANOVA?

Answer: use F-test to see whether there are significant difference for multiple groups of objects. One is the within group variance, the other is the Between/Across group variance.

Here is an example using IsolationForest algorithm in python.  
#===========================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
clf=IsolationForest(max_samples=100,contamination=0.0015, 
random_state=rng)
clf.fit(df_train)
outlier = clf.predict(df_train)
df_train1['outlier'] =outlier 
df_train1.outlier.describe()
#===========================================================
here is a sample code to define the function to catch outliers:
#===========================================================
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence

def catch_outlier(Some_ID):
    outlier1 = input_data[input_data.ID==Some_ID]
    outlier1.shape
    ols = smf.ols('dep_var ~ var1_num  + var2_num +
     C(var_category, Treatment))', data=outlier1).fit()
#    print(ols.summary())
#    ols.params
#    ols.rsquared
#    start =  time.time()    
    output1= OLSInfluence(ols)
    student_resid=output1.get_resid_studentized_external()
    outlier1['student_resid'] =student_resid
    cooks_d=output1.cooks_distance
    outlier1['cooks_d'] =cooks_d[0]     
#    end = time.time()
#    print(round((end - start)/60,2), "mins" )   

    # cacluate outlier flags based on studentR and cooksD cutoffs    
    outlier1['studentr_flag'] = outlier1['student_resid'] >= 3
    outlier1['cooksd_flag'] = outlier1['cooks_d'] > 4/outlier1.shape[0]
    outlier1['outlier']=outlier1.studentr_flag & outlier1.cooksd_flag
    outlier2 =outlier1[outlier1['outlier']==1]    
    outlier3 =outlier2[['outlier','transaction_id','ID']]

    outlier_all.append(outlier3)
    return outlier_all
#===========================================================
Question: what is the difference between internal studentized residual and external studentized residual?
1. Studentized residual involves the standardization of (y-y_hat), the denorminator is sigma*sqrt(1-h_ii), where h_ii is the orthogonal projection to the column space. 

2. For the internal studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m) for all the observations. 

3. For the external studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m-1) for all the observations except ith observation itself. 

4. If the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, which is for the external studentized residual.

Here is the very useful tutorial about how to get the outlier statistics indicators.
Notice there are significant difference in running-time if you didn't chooce those carefully.
#===========================================================
##==using summary_frame will take much longer than  OLSInfluence ===
oi = ols.get_influence()
Student_red_cook=oi.summary_frame()[["student_resid","cooks_d"]]
#==summary_frame will generate a few more other unused stats  
#==Cook distance and DFFITS are essentially the same 
#===========================================================
##==prefer using OLSInfluence,it's much quicker
 output1= OLSInfluence(ols)
 student_resid=output1.get_resid_studentized_external()
 outlier1['student_resid'] =student_resid
 cooks_d=output1.cooks_distance
 outlier1['cooks_d'] =cooks_d[0]     
#===========================================================
#==Another comparison: using the get_ will much quicker    
student_resid=output1.get_resid_studentized_external()
#student_resid=output1.resid_studentized_external ##==take much longer 
#===========================================================
  • IsolationForest approach, which is similar to the random forest methodology.
The main idea, which is different from other popular outlier detection methods, is that Isolation Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space).That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

Here are some sample code in Python:
from sklearn.ensemble import IsolationForest
# training the model
rng = np.random.RandomState(42)

clf = IsolationForest(n_estimators=100, max_samples='auto', 
random_state=rng,contamination=0.01, 
max_features=1.0, bootstrap=False, 
n_jobs=None, behaviour='old',  verbose=0)

df_x1.head()
clf.fit(df_x1)

# predictions
y_pred = clf.predict(df_x1)
y_pred[0:5]

df_x1['isolation']=y_pred
df_x1.head()
df_x1.isolation.value_counts()
df_x1.sort_values('isolation').head(10)

#output -1 for outliers, 1 for normal
df_xa=df_x1[(df_x1.isolation==-1)]
df_xa.head()

#generate the score for each record: 
#using either score_samples() or ecision_function()
pred1=clf.score_samples(df_x1)
pred1=clf.decision_function(df_x1)
df_x1['isolation']=pred1
df_x1.head() 

The idea of identifying a normal vs. abnormal observation can be observed in the following graph. A normal point (on the left) requires more partitions to be identified than an abnormal point (right).
Isolation Forest


If we need to build a regression model to predict our target variable, then the best approach is definitely the Deviation-based model approach. For instance, if you are trying to predict a car value, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the hign end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

If we have to choose between Mahalanobis distance and IsolationForest approach, my previous working experience tells me that Mahalanobis distance is more preferred than the isolationforest. It's more explainable to use  Mahalanobis distance than IsolationForest. 
Benford's law application in the outlier detection
Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading digit is likely to be small.

For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Here are the exact distribution table and graph.
In 1995, the phenomenon was finally proven by Hill. His proof was based on the fact that numbers in data series following the Benford’s Law are, in effect, “second generation” distributions, i.e. combinations of other distributions.

Benford's law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford's law applies. For instance, one can expect that Benford's law would apply to a list of numbers representing the populations of UK settlements. But if a "settlement" is defined as a village with population between 300 and 999, then Benford's law will not apply.

In the United States, evidence based on Benford's law has been admitted in criminal cases at the federal, state, and local levels.

Benford's law has been invoked as evidence of fraud in the 2009 Iranian elections,and also used to analyze other election results. However, other experts consider Benford's law essentially useless as a statistical indicator of election fraud in general.

It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, etc.

No comments:

Post a Comment

Python Study Notes: how to load stock data, manipulate data, find patterns for profit?

#================================================ from pandas_datareader import data as pdr #run the upgrade if see error: pandas_dataread...