EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

Python programming Must-know Skills -1
How do we install packages in python?
How do we send emails in python?
How do we load/output csv file in/to Pytyon
How do we read/load R datasets in/to Python
How do we subset dataframe in Pytyon
How do we use simple SQL in python dataframe
How do we download data from netezza?
How do we download data from amazon redshift?
How do we upload data from python onto amazon redshift?
How do we include a seprate piece of python code to run?
How do we convert a list to dataframe into dataframe?
Local vs. Global Variables
How do we get outliers from python?
Internal studentized residual VS external studentized residual
How do we scatter plot in python?



Question: How do python?
#===============================================================

#===============================================================


Question: How do we include a seprate piece of python code to run?
#===============================================================
# include a separate python code to run, need to change folder first 
#===============================================================
import os
path="C:/IS/directory/"
os.chdir(path)

#Then from that directory you can import any frequetly-used function 
from comm_util import drop_table,  init_connection, put_pickle_to_s3
import comm_util
#===========================================================


Question: How do we send emails in python?
#===========================================================
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

fromaddr = "frist.last@campany.com"
##== you can also set it empty:  fromaddr = "" 
toaddr = "email.address.to.send@domain.com"
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['To'] = toaddr
msg['Subject'] = "Oh, Yeal! Python Task finished! "

body = "Python Task-1 finished"
msg.attach(MIMEText(body, 'plain'))

import smtplib
server = smtplib.SMTP('smtp.yourcompanydomain.int', 25)
server.ehlo()
server.starttls()
server.ehlo()
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
#===========================================================
Question: How do we install packages in python?
#===========================================================
#===there are several ways to install packages in python
#Method 1: using anaconda to install
#search Anaconda Prompt from "Start"
pip install -U pandasql
# if it shows permission error, use this one: 
pip install --user pandasql   
# ==========================================================
          
pip install pandas_redshift #== PIP:python install package?
pip install spyder    #=====  install spyder
pip install --upgrade spyder  #==== upgrade spyder

#===========================================================
#Method 2: if no anaconda, then run the code directly in IDE spyder
import pip
pip.main(['install','pandas_redshift'])
#==In case if AttributeError: module 'pip' has no attribute 'main'
#==then you use the following one: 
pip._internal.main(['install','pandas_redshift'])
from pip._internal import main #===it's due to the pip3 issue==
#===========================================================
Question: How do we get outliers from python?

There are quite a few ways to do outlier detection, including:

1. IQR methold, Distiance-based approach(Mahalanobis distance, one-Class SVM, IsolationForest algorithm in python etc), density-based approach(Local Outlier Factor), those approaches are based on spatial proximity; in other words, they measure how far away from the observation to the other cluster of observations, using either one-dimention, or multiple dimentions.

2. Deviation-based Model approach(Cook Distance, Studentized Residual), you need to build a statistical model first to detect those outliers.

The main difference between those 2 approaches: those IQR methold, density-based/distance-based approaches, catch the outlier that is pretty much the extreme values on one or mutiple dimensions. Meanwhile, the Model approach(Cook Distance, Studentized Residual) are more focused on the distance to the fitting curve, instead of the distance to the raw data cluster itself.

For instance, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the hign end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

Now you can easily see the pros and cons for those 2 approaches. The 1st one with the relative simple dimensional distance approach, is more generic to catch the outlier, not to specific model. The 2nd one based on the specific model approach, is more powerful when we are trying to remove outliers to fit a particular model. In other words, if you are trying to clean the data before fitting your model, we prefer the 2nd model-based statistics approach.

Here is an example using IsolationForest algorithm in python.

#===========================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
clf=IsolationForest(max_samples=100,contamination=0.0015, 
random_state=rng)
clf.fit(df_train)
outlier = clf.predict(df_train)
df_train1['outlier'] =outlier 
df_train1.outlier.describe()
#===========================================================


here is a sample code to define the function to catch outliers:
#===========================================================
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence

def catch_outlier(Some_ID):
    outlier1 = input_data[input_data.ID==Some_ID]
    outlier1.shape
    ols = smf.ols('dep_var ~ var1_num  + var2_num +
     C(var_category, Treatment))', data=outlier1).fit()
#    print(ols.summary())
#    ols.params
#    ols.rsquared
#    start =  time.time()    
    output1= OLSInfluence(ols)
    student_resid=output1.get_resid_studentized_external()
    outlier1['student_resid'] =student_resid
    cooks_d=output1.cooks_distance
    outlier1['cooks_d'] =cooks_d[0]     
#    end = time.time()
#    print(round((end - start)/60,2), "mins" )   

    # cacluate outlier flags based on studentR and cooksD cutoffs    
    outlier1['studentr_flag'] = outlier1['student_resid'] >= 3
    outlier1['cooksd_flag'] = outlier1['cooks_d'] > 4/outlier1.shape[0]
    outlier1['outlier']=outlier1.studentr_flag & outlier1.cooksd_flag
    outlier2 =outlier1[outlier1['outlier']==1]    
    outlier3 =outlier2[['outlier','transaction_id','ID']]

    outlier_all.append(outlier3)
    return outlier_all
#===========================================================


Notice there are significant difference in running-time if you didn't chooce those carefully.

#===========================================================
##==using summary_frame will take much longer than  OLSInfluence ===
oi = ols.get_influence()
Student_red_cook=oi.summary_frame()[["student_resid","cooks_d"]]
#==summary_frame will generate a few more other unused stats  
#==Cook distance and DFFITS are essentially the same 
#===========================================================
##==prefer using OLSInfluence,it's much quicker
 output1= OLSInfluence(ols)
 student_resid=output1.get_resid_studentized_external()
 outlier1['student_resid'] =student_resid
 cooks_d=output1.cooks_distance
 outlier1['cooks_d'] =cooks_d[0]     
#===========================================================
#==Another comparison: using the get_ will much quicker    
student_resid=output1.get_resid_studentized_external()
#student_resid=output1.resid_studentized_external ##==take much longer 
#===========================================================


Question: what's the difference between internal studentized residual and external studentized residual?

1. Studentized residual involves the standardization of (y-y_hat), the denorminator is signam*sqrt(1-h_ii)

2. For the internal studentized residual, the signam comes from: sum of the (y-y_hat)^2/(n-m) for all the observations.

3. For the external studentized residual, the signam comes from: sum of the (y-y_hat)^2/(n-m-1) for all the observations except ith observatin itself.

4. If the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, which is for the external studentized residual.

Here is the very useful tutorial about how to get the outlier statistics indicators. In case if the url is outdated, here is the pdf copy of that.

Question: How do we get the corresping weekend for the specified date in python?
#=====================================
weekno = datetime.date.today().isoweekday()
weekno
weekend= datetime.date.today()-datetime.timedelta(weekno)

##====it's more tricky to get those value for a variable in dataframe
data0['date_wkend']=pd.to_datetime(data0.sale_date)
                   -pd.to_timedelta(
                     pd.to_datetime(data0.sale_date)
                      .dt.weekday+1,unit='D')
##==1.you need to apply dt to overcome the series error message.                                                                       
#===========================================================
Question: How do we load data csv/excel into/output python?
#===========================================================
 #read csv file using panda.read_csv
#===========================================================
# Make sure pandas is loaded
import pandas as pd
file=r'C:\Diretory\Name_of_the_file_import.csv'
test = pd.read_csv(file)
test.shape
test.head()

#====================================
 #save csv file using panda.to_csv
#====================================
file=r'C:\Diretory\Name_of_the_file_output.csv'
pd.DataFrame(dataname_to_output).to_csv(path_or_buf=file)

                    
#====================================
 #save results to excel sheets
#====================================
file=r'C:\directory\excel_file_name_you_want.xlsx'
writer = pd.ExcelWriter(file, engine='xlsxwriter')

result[0].to_excel(writer, sheet_name='Sheet1_name')
workbook = writer.book

worksheet = writer.sheets['Sheet1_name']
worksheet.set_zoom(90)
percent_fmt = workbook.add_format({'num_format': '0.000%', 'bold': False})
worksheet.set_column('A:G', 20)
worksheet.set_column('C:G', 20, percent_fmt)
#===========================================================
Question: How do we subset dataframe in python?
#================================================
# Method best: select via columns number========
#very convenient to use iloc to subset data: ====
#Be careful: Python uses 0-based indexing
#test1=test.iloc[:,[0,1,-1,6]]
#test1.tail()

# select rows 0, 1, 2 (row 3 is not selected)
test[0:3]

# select rows of 0 and 10 (not from 0 to 10)
test.loc[[0, 10], :]
 
# Method 1: select a 'subset' of the data using the column name
test[['var1', 'var2']]

# Method 2: use the column name as an 'attribute'
test.species_id

# select the first 5 rows (rows 0, 1, 2, 3, 4)
test[:5]

# select the last element in the list
# (the slice starts at the last element,
# and ends at the end of the list)
test[-1:]   

#create dummies via where statement
test1['new_var']=np.where(test['var_check']>=5,1,0)

#subset dataframe via muitiple filters
test2=test[(test['var1_check']==1) 
         & (test['saledate']<=holdout + timedelta(days=7*i))]

test1=test1.rename(columns={"old_var_name":"new_var_name"})
#================================================
 
Some of the tutorials are referenced from here.

Question: How do we load R datasets/objects into/output python?

#===========================================================
#===You need to install and use rpy2 package===============
#===rpy2 is an interface to R running embedded in Python.===
#===use the followoing code to install via anaconda prompt===
conda install -c r rpy2 
#===pip install rpy2 might not able to install sometimes===
import rpy2
print(rpy2.__version__)
#===========================================================
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
#===========================================================
readRDS = robjects.r['readRDS']
df = readRDS('my_file.rds')
df = pandas2ri.ri2py(df)
#===========================================================
Question: How do we scatter plot in python?
#===========================================================
#Method 1: use matplotlib to scatter plot by group 
import matplotlib.pylab as plt
groups = data_input.groupby('plot_by_this_var_group')
fig,ax = plt.subplots()  #==fig is the parent figure
fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(nrows=2, ncols=2)
#== output to 2*2 plots
fig, [ax1, ax2, ax3, ax4] = plt.subplots(nrows=1, ncols=4)
#== output to 1 row *4 columns plots

ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.x_axis_name, group.y_axis_name, marker='.', 
    linestyle='', ms=12, label=name)
ax.legend()
plt.show()
 
#===========================================================    
Method 2: use seaborn package: a graphic library built on top of Matplotlib 
import seaborn as sns
import matplotlib.pylab as plt
sns.lmplot( x="x_axis_name", y="y_axis_name",palette=["red", "blue", "black", "orange", 
"pupple"], data=data_input, fit_reg=False, hue='plot_by_this_var_group', legend=False)
plt.legend(loc='lower right')
plt.show()
#===========================================================


Queston: say we have a list of dataframes, i.e. each element in the list is a dataframe with same variable names, how do we combine them into one dataframe?
#===========================================================
global New_dataframe
New_dataframe=pd.DataFrame()
for i in range(len(Old_list)):
    New_dataframe=New_dataframe.append(Old_list[i])
New_dataframe.shape    
New_dataframe.head()
#===========================================================


Question: How do we download data from amazon redshift?
We are using the package sqlalchemy to download the data from redshift. Sqlalchemy is a SQL library under the licence from MIT.
#===============================================================
import sqlalchemy as sa

engine_setup = sa.create_engine('redshift+psycopg2://
{user}:{password}@{host}:{port}/{dbname}'.format(
                            dbname=DATABASE, 
                            user=USER, 
                            password=PASSWORD,
                            port=PORT,
                            host=HOST),
                    encoding='utf-8',
                    connect_args=
{"keepalives": 1, "keepalives_idle": 60, "keepalives_interval": 60} 
                        )
                       
sql_query = """select var1,var2 from data0 where var_filter = '****' """
data_in=pd.read_sql_query(sql_query, con=engine_setup)
data_in.shape
#===============================================================



Question: How do we download data from IBM netezza data warehouse?
#===============================================================
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
         PORT=5480;DATABASE=***;UID=*****;PWD=*****;DSN=NZSQL")
print("Connection complete")
cursor=conn.cursor() 

cursor.execute("SELECT * FROM data_input limit 1000")
colnames=[desc[0] for desc in cursor.description]
df=pd.DataFrame.from_records(cursor.fetchall(),columns=colnames)
df=pd.DataFrame.from_records(cursor.fetchmany(size=2),columns=colnames) 
df.shape
df.describe()
df.cols()

cursor.close()
conn.close()
#===============================================================
Question: How do we upload data from python into IBM Netezza data warehouse?

#===============================================================
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
         PORT=5480;DATABASE=***;UID=*****;PWD=*****;DSN=NZSQL")         
cursor = conn.cursor()
# create table
cursor.execute("CREATE TABLE TEST("
    "C_ID  INT,"
    "C_FP  FLOAT,"
    "C_VARCHAR VARCHAR(100),"
    "C_DATE DATE, C_TIME TIME,"
    "C_TS TIMESTAMP,"
    "C_BOOL BOOL)")
cursor.execute("INSERT INTO test VALUES(1,1.1,'asdgaing',
      '1901-01-01','23:12:34','1901-01-01 09:00:09','t')")
cursor.execute("INSERT INTO test VALUES(2,3.4,'asdfg2345',
      '1991-11-11','00:00:01','1981-12-31 19:19:19','f')")
cursor.execute("SELECT * FROM TEST")
rows = cursor.fetchall()
for row in rows:
    print(row, end='\n')
cursor.execute("DROP TABLE TEST CASCADE")
cursor.close()
cnxn.close()
#===============================================================

Question: How do we upload data from python onto amazon redshift?

#===============================================================
import pandas_redshift as pr
pr.connect_to_s3(aws_access_key_id = '***********',
              bucket ='******-nzref',
              aws_secret_access_key = '2*********A'
              )
pr.connect_to_redshift(dbname = "your_database_name",
               user = "your_user_name",
               password = "your_pw",
               port = "5439",
               host = "****.redshift.amazonaws.com")

data_type=Your_data_in_python.dtypes.replace
(['object',     'float64', 'int64',  'datetime64[ns]'],
['VARCHAR(255)','FLOAT',   'BIGINT', 'TIMESTAMP'])
pr.pandas_to_redshift(data_frame = Your_data_in_python,
redshift_table_name = 'New_dataname_in_redshift',column_data_types =data_type) 
#===============================================================


Question: How do we use simple sql in python dataframe?


Answer: In python you can use a lot other functitons to do grouping summries or it's very flexible, however, the syntax is not that easy to remember, you probably need to check the example syntax everytime when you need to use those. Most people are familiar with SQL, either from the previous analytic IT related training, or from the previous SAS/R SQL experience, so it would be great if we can continue using the similar SQL query in python. We can use pysql for this purpose.


#===============================================================
pip install pysql ##== run this in anaconda prompt to install==

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())

data1=pysql("""select var0,count(*) as cnt,sum(var1)/sum(var2) 
          as pcnt from data0 group by 1""")
          
data2=pysql("""select a.*,b.pcntt from data0 as a 
          inner join data1 as b on a.var0=b.var0""")
          
##==Use 3 double quotes for mutiple lines, or use 1 double quote.          
#===============================================================
In case if you are interested in using pivot_table or groupby functions to do summaries, here are some examples.


#===============================================================
test3 = pd.DataFrame.merge(test1, test2, left_on=['var1','var2'],
        right_on=['var1','var2'], how='left')
         
list1 = ['metric_var1','metric_var2']
summary1 = ds[ (ds['var1']>=0)  &  (ds['var2']==0)]
           .pivot_table(index='var0', values=list1, aggfunc='sum')
      
summary2=ds[(ds['var1']>=date0) & (ds['var2']==0)]
           .groupby('var0',as_index=False).agg({"var4": 'count'})  
#===============================================================




Question: What's global/local variable/dataframe/environment in Python?
Week 2: #1: Example of a simple event-driven program
# Import CodeSkulptor GUI module
import simplegui

# Event handler
def tick():
    print "tick!"

# Register handler
timer = simplegui.create_timer(1000, tick)

# Start timer
timer.start()
#2: global vs local examples
# num1 is a global variable
num1 = 1
print num1

# num2 is a local variable
def fun():
    num1 = 2
    num2 = num1 + 1
    print num2
fun()

# the scope of global num1 is the whole program, num 1 remains defined
print num1

# the scope of the variable num2 is fun(), num2 is now undefined
print num2

# why use local variables?
# give a descriptive name to a quantity
# avoid computing something multiple times

def fahren_to_kelvin(fahren):
    celsius = 5.0 / 9 * (fahren - 32)
    zero_celsius_in_kelvin = 273.15
    return celsius + zero_celsius_in_kelvin
print fahren_to_kelvin(212)

# the risk/reward of using global variables
# risk - consider the software system for an airliner
# critical piece - flight control system
# non-critical piece - in-flight entertainment system
# both systems might use a variable called "dial"
# we don't want possibility that change the volume on your audio
# causes the plane's flaps to change! 

# example
num = 4

def fun1():
    global num
    num = 5
    
def fun2():
    global num
    num = 6

# note that num changes after each call with no obvious explanation    
print num
fun1()
print num
fun2()
print num

# global variables are an easy way for event handlers
# to communicate game information.
# safer method - but required more object-programming techniques
# Local vs. Global Variables Example
# For this example, there are five versions of the same program. Three of them work and two of them don't. The goal of the program is to move an object along a number line using a move method, keep track of its location with the variable loc, and calculate the displacement (always positive) from its starting location with the displacement method.

# Version one: Doesn't work. The move method throws an error,
#	while the displacement method always prints 0 because
#	the location never changes.
start = 1
loc = start

def move(x):
    loc = loc + x

def displacement():
    return abs(loc - start)

#move(3)
print "Version 1:", displacement()
#move(-2)
print "Version 1:", displacement()
print

# Version two: Works. Fixes version one by declaring variables
#	global at the start of the move method. No global
#	declaration is needed at the start of the displacement
#	method because the values in the global variables loc 
#	and start are not changing.

start = 1
loc = start

def move(x):
    global loc
    loc = loc + x

def displacement():
    return abs(loc - start)

move(3)
print "Version 2:", displacement()
move(-2)
print "Version 2:", displacement()
print

# Version three: Also works. This one returns values instead
#	of attempting to override the global variable. Notice 
#	that the new local variable must have a different name.
#	Notice also that we must assign loc to the value returned
#	by the move method.

start = 1
loc = start

def move(x):
    pos = loc + x
    return pos

def displacement():
    return abs(loc - start)

loc = move(3)
print "Version 3:", displacement()
loc = move(-2)
print "Version 3:", displacement()
print

# Version Four: This one does not work. The loc that is 
#	a parameter in the move method is actually a local 
#	variable instead of a global one, and therefore the 
#	value of the global loc does not change.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

move(3, loc)
print "Version 4:", displacement()
move(-2, loc)
print "Version 4:", displacement()
print


# Version Five: This one fixes the problem from version 
#	four. This one passes the method the value of loc as a 
#	parameter, and returns the value of the new loc. Note 
#	that in this example the local variable shares the same
#	name as the global one, but is not actually the same thing.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

loc = move(3, loc)
print "Version 5:", displacement()
loc = move(-2, loc)
print "Version 5:", displacement()
print  



# CodeSkulptor runs Python programs in your browser.
# Click the upper left button to run this simple demo.
# CodeSkulptor runs in Chrome 18+, Firefox 11+, and Safari 6+.
# Some features may work in other browsers, but do not expect
# full functionality.  It does NOT run in Internet Explorer.

import simplegui

message = "Welcome!"

# Handler for mouse click
def click():
    global message
    message = "Good job!"

# Handler to draw on canvas
def draw(canvas):
    canvas.draw_text(message, [50,112], 36, "Red")

# Create a frame and assign callbacks to event handlers
frame = simplegui.create_frame("Home", 300, 200)
frame.add_button("Click me", click)
frame.set_draw_handler(draw)

# Start the frame animation
frame.start() 



 Related links:      
Continue to Use SQl statement in R   SAS Interview

Back to Statistics tutorial home   Excel Analytics Tutorial Home