Saturday, December 7, 2019

Python study notes 1 - Installation basics, local vs global

How do we install packages in python?
How do we send emails in python?
How do we load/output csv file in/to Pytyon
How do we output import pickle file in python
How do we read/load R datasets in/to Python
How do we create empty dataframe and then fill values
How do we subset dataframe in Pytyon
How do we use simple SQL to python dataframe
How do we download data from netezza?
How do we download data from amazon redshift?
How do we upload data from python onto amazon redshift?
How do we include a seprate piece of python code to run?
How do we convert a list to dataframe ?
Local vs. Global Variables
How do we get outliers from python? 

How do we rank column by certain group?
Internal studentized residual VS external studentized residual
How do we scatter plot in python?
How do we drop big dataframe and release memory in python?
How do we break a list column to multiple bins, decile or quantile?


Question: How do we install python the easiest?
In case if you have any concern or issues with Python, you can re-install python via anaconda via the following steps:

1. Click uninstall "Uninstall-Anaconda3" icon from the link: C:\Users\*****\AppData\Local\Continuum\anaconda3 to uninstall previous anaconda. This might take 20-30 mins to finish, and mannually delete the folder: anaconda3 after the uninstallation.

2. Go to https://www.anaconda.com/distribution/ to download the corresponding version, installation might take just a few mins.

3. type to run: anaconda prompt, run conda list anaconda to see which version of anaconda installed ,
update to the newest: conda update --all , then conda update anaconda . If you want to run Spyder, you need to go to anaconda navigation to launch Spyder first, then pin to taskbar.

4. install python package via anaconda, for example, run the following code from anaconda prompt window:
#===============================================================
conda install -c anaconda boto3
y
conda install -c conda-forge s3fs  
y
conda install -c conda-forge psycopg2                                    
y
conda install -c conda-forge pandasql
y
conda install -c https://conda.anaconda.org/anaconda pandas-datareader
y
#don't run this: conda install -c conda-forge keras
#it will cause anaconda promt not working,use pip to install keras
pip install tensorflow
y
pip install keras
y
conda install -c conda-forge sqlalchemy 
y
conda install -c conda-forge sqlalchemy-redshift 
y
#this one will not work: conda install -c conda-forge pandas-datareader                              
#it will give you error message: anaconda packagesnotfounderror the following package not available.

#install NLP related packages, topic modeling etc. 
conda install -c anaconda gensim
y
conda install -c anaconda spacy
y
conda install -c anaconda pyLDAvis
y
conda install -c anaconda nltk
y
# Run in python console
import nltk 
#in case if you load the en_core_web_sm from spacy,you might got the error message: 
nlp = spacy.load('en_core_web_sm')
##error message: Can't find model 'en_core_web_sm', you can run the following from terminal: 
python -m spacy download en
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
python -m spacy link en_core_web_sm en

#If still nont working:
#Then you can use the link directly from github: 
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz --no-deps

pip install stockstats
pip install https://github.com/matplotlib/mpl_finance/archive/master.zip
#or use this: pip install mpl_finance

#This one might be working: conda install -c primer stockstats
#it will not work with: conda install -c https://conda.anaconda.org/anaconda stockstats
#neither work with: conda install -c conda-forge stockstats
#in case if error: unsatisfiableerror conflict conda version, 
#run at anaconda prompt for: 
pip install stockstats 
#===============================================================

Question: How do we standardize data in python?
#===============================================================
atm3a=atm3[['mileage','condition_score','my_age']]
standard_sc = scale.StandardScaler()
atm3b = pd.DataFrame(standard_sc.fit_transform(atm3a))
atm3b.columns=['mileage1','condition1','age1']
atm3b.tail()
atm3b.age.describe()
atm3b.mileage.describe()
atm3b.index

#min_max_scaler = preprocessing.MinMaxScaler()
#X_train_minmax = min_max_scaler.fit_transform(X_train)
#===============================================================

Question: How do we include a seprate piece of python code to run?
#===============================================================
#re-use some common functions  
import importlib
comm_util = importlib.import_module('.s3_comm_util', package='dsv-aws-common')
##package here is the folder name
get_pickle_from_s3 = comm_util.get_pickle_from_s3
get_s3_csv_gz = comm_util.get_s3_csv_gz

# include a separate python code to run, need to change folder first 
#===============================================================
##===to see the curent working directory===##
import os
cwd = os.getcwd()
print(cwd)

# use '\\' while chaning the directory 
os.chdir("C:\\user\\foldername")

#Then from that directory you can import any frequetly-used function 
from comm_util import drop_table,  init_connection, put_pickle_to_s3
import comm_util
#===========================================================

Question: How do we send emails in python?
#===========================================================
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

fromaddr = "frist.last@campany.com"
##== you can also set it empty:  fromaddr = "" 
toaddr = "email.address.to.send@domain.com"
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['To'] = toaddr
msg['Subject'] = "Oh, Yeal! Python Task finished! "

body = "Python Task-1 finished"
msg.attach(MIMEText(body, 'plain'))

import smtplib
server = smtplib.SMTP('smtp.yourcompanydomain.int', 25)
#server.ehlo()
#server.starttls()
#server.ehlo()
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
#===========================================================

Question: How do we install packages in python?
In Python, packages such as Flask are available in a public repository, from where anybody can download them and install them. The official Python package repository is called PyPI, which stands for Python Package Index(PyPI).
#===========================================================
#use the following to get any help: 
!pip install --help  #running in notebook or spyder
#===there are several ways to install packages in python
#Method 1: using anaconda to install
#search Anaconda Prompt from "Start"
pip install -U pandasql
# if it shows permission error, use this one: 
pip install pandasql --user    
# ==========================================================
#why we are using --user: because by default, pip defaults to installing Python packages 
#to a system directory (such as /usr/local/lib/python3.4). This requires root access.
#--user makes pip install packages in your home directory instead, 
#which doesn't require any special privileges.

pip install pandas_redshift #== PIP:python install package?
pip install spyder    #=====  install spyder
pip install --upgrade spyder  #==== upgrade spyder

#===========================================================
#Method 2: if no anaconda, then run the code directly in IDE spyder
# or run the following in jupyter notebook 
import pip
from pip._internal import main #===it's due to the pip3 issue==
pip._internal.main(['install','pandas_redshift'])
#==for older version: AttributeError##
#==module 'pip' has no attribute 'main' if use: 
pip.main(['install','pandas_redshift'])
#===========================================================
here is some tricky way to format all float number to 3 decimals, instead of scientific notations.

import pandas as pd
pd.options.display.float_format = '{:.4f}'.format


Question: How do we get outliers from python? How do we rank column by certain group?
There are quite a few ways to do outlier detection, including:

1. IQR methold, Distiance-based approach(Mahalanobis distance, one-Class SVM, IsolationForest algorithm in python etc), density-based approach(Local Outlier Factor), those approaches are based on spatial proximity; in other words, they measure how far away from the observation to the other cluster of observations, using either one-dimention, or multiple dimentions.

2. Deviation-based Model approach(Cook Distance, Studentized Residual), you need to build a statistical model first to detect those outliers.

#generate the percentile for all the values in the columns:
df['pct_rank'] = df['Number_legs'].rank(pct=True)
#the output will be range from 0-100%.
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
#treat the NaN ones as the botton of the rank(the largest).

df["rank"] = df.groupby("group_ID")["value"].rank("dense", ascending=False)
>>> df
     group_ID   item_ID  value  rank
0  0S00A1HZEy      AB     10     2
1  0S00A1HZEy      AY      4     3
2  0S00A1HZEy      AC     35     1 

#we can also use qcut function in python to rank columns, for example
data1['ranks'] = data1.groupby('group_var')['column_var'].transform(pd.qcut,3, labels=False,duplicates='drop')
#3 means rank of 0,1,2.

#remove any value that is more than certain percentile
numpy.percentile(df.a,95) is much faster than use: df.a.quantile(.95)
t2b= t2b[(t2b["ratio_list"] < np.percentile(t2b["ratio_list"],97.5))] 
The main difference between those 2 approaches: those IQR methold, density-based/distance-based approaches, catch the outlier that is pretty much the extreme values on one or multiple dimensions. Meanwhile, the Model approach(Cook Distance, Standardized Residual) are more focused on the distance to the fitting curve, instead of the distance to the raw data cluster itself.

For instance, a 25-year old vehicle, with 300,000 miles on that, get very low value, say $500. If we check all those dimensions, either from the mileage, age, or value, all those are in the extreme ends, some in the high end(age/mileage), some are in the low end(value), so that vehicle will be easily be caught as outliers from the distance approach. However, based on our intuitive understanding, an old vehicle like that, deserved that low value $500, in other words, it is most likely to be reasonable case, but not outlier cases.

Now you can easily see the pros and cons for those 2 approaches. The 1st one with the relative simple dimensional distance approach, is more generic to catch the outlier, not to specific model. The 2nd one based on the specific model approach, is more powerful when we are trying to remove outliers to fit a particular model. In other words, if you are trying to clean the data before fitting your model, we prefer the 2nd model-based statistics approach. Here is an example using IsolationForest algorithm in python.  

Question: How do save/export/load python data frame to R datatype?
#===========================================================
###Export in Python===
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
###Import in R=========
library(feather)
path <- df="" file="my_data.RData" font="" my_data.feather="" path="" read_feather="" save="">
#===========================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
clf=IsolationForest(max_samples=100,contamination=0.0015, 
random_state=rng)
clf.fit(df_train)
outlier = clf.predict(df_train)
df_train1['outlier'] =outlier 
df_train1.outlier.describe()
#===========================================================
here is a sample code to define the function to catch outliers:
#===========================================================
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence

def catch_outlier(Some_ID):
    outlier1 = input_data[input_data.ID==Some_ID]
    outlier1.shape
    ols = smf.ols('dep_var ~ var1_num  + var2_num +
     C(var_category, Treatment))', data=outlier1).fit()
#    print(ols.summary())
#    ols.params
#    ols.rsquared
#    start =  time.time()    
    output1= OLSInfluence(ols)
    student_resid=output1.get_resid_studentized_external()
    outlier1['student_resid'] =student_resid
    cooks_d=output1.cooks_distance
    outlier1['cooks_d'] =cooks_d[0]     
#    end = time.time()
#    print(round((end - start)/60,2), "mins" )   

    # cacluate outlier flags based on studentR and cooksD cutoffs    
    outlier1['studentr_flag'] = outlier1['student_resid'] >= 3
    outlier1['cooksd_flag'] = outlier1['cooks_d'] > 4/outlier1.shape[0]
    outlier1['outlier']=outlier1.studentr_flag & outlier1.cooksd_flag
    outlier2 =outlier1[outlier1['outlier']==1]    
    outlier3 =outlier2[['outlier','transaction_id','ID']]

    outlier_all.append(outlier3)
    return outlier_all
#===========================================================
Notice there are significant difference in running-time if you didn't chooce those carefully.
#===========================================================
##==using summary_frame will take much longer than  OLSInfluence ===
oi = ols.get_influence()
Student_red_cook=oi.summary_frame()[["student_resid","cooks_d"]]
#==summary_frame will generate a few more other unused stats  
#==Cook distance and DFFITS are essentially the same 
#===========================================================
##==prefer using OLSInfluence,it's much quicker
 output1= OLSInfluence(ols)
 student_resid=output1.get_resid_studentized_external()
 outlier1['student_resid'] =student_resid
 cooks_d=output1.cooks_distance
 outlier1['cooks_d'] =cooks_d[0]     
#===========================================================
#==Another comparison: using the get_ will much quicker    
student_resid=output1.get_resid_studentized_external()
#student_resid=output1.resid_studentized_external ##==take much longer 
#===========================================================
Question: what is the difference between internal studentized residual and external studentized residual?
1. Studentized residual involves the standardization of (y-y_hat), the denorminator is sigma*sqrt(1-h_ii), where h_ii is the orthogonal projection to the column space. 

2. For the internal studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m) for all the observations. 

3. For the external studentized residual, the sigma comes from: sum of the (y-y_hat)^2/(n-m-1) for all the observations except ith observation itself. 

4. If the i th case is suspected of being improbably large, then it would also not be normally distributed. Hence it is prudent to exclude the i th observation from the process of estimating the variance when one is considering whether the i th case may be an outlier, which is for the external studentized residual. Here is the very useful tutorial about how to get the outlier statistics indicators.

Question: Why do we get the data size of all folder from a directory?
#===============================================================
import os
import numpy as np
import pandas as pd
from os.path import join, getsize

global t0,all0
d = {'size_of_MB':[0], 
     'directory_file': ['*******************************************************']}
t00 = pd.DataFrame(data=d)
t00
t0=t00[0:1]
all0=t00[0:0]

def get_size_folder():
    global t0,all0
    i=0
    for root, dirs, files in os.walk('C:\\Users\\***\\Google_Drive'):
        i=i+1 
        size=round(sum(getsize(join(root, name))/(1024*1024) for name in files),0)  
        t0['size_of_MB']=size
        t0['directory_file']=str(root)
#        print(t0)
        all0=all0.append(t0)
#        print(i,' : ',root, ": ",size,' MB ')
    return all0

a0=get_size_folder()
a0.shape
a0.head()

#use the split to scan break the string into multiple substrings
new = a0["directory_file"].str.split("\\", n = -1, expand = True)
new[4].head() 
a1=a0
a1["d1"]= new[4] 
a1["d2"]= new[5] 
a1["d3"]= new[6] 
a1["d4"]= new[7] 
a1["d5"]= new[8] 
a1["d6"]= new[9] 
a1["d7"]= new[10] 
a1.head()

#check the size the least granual folder
a11=pysql("""select d1,sum(size_of_MB) as size_of_MB 
from a1 group by 1 order by 2 desc """)
a11.head(10)

a12=pysql("""select d1,d2,sum(size_of_MB) as size_of_MB 
from a1 group by 1,2 order by 3 desc """)
a12.head()

a13=pysql("""select d1,d2,d3,sum(size_of_MB) as size_of_MB 
from a1 group by 1,2,3 order by 3 desc """)
a14.head()
#===============================================================
Question: How do we handle date, year,month in python? How do we get the corresping weekend for the specified date in python?
#=====================================
##if date1 is already in datetime format
df1['year'] =df1['date1'].dt.year 
df1['month'] =df1['date1'].dt.month

weekno = datetime.date.today().isoweekday()
weekno
weekend= datetime.date.today()-datetime.timedelta(weekno)

##====it's more tricky to get those value for a variable in dataframe
data0['date_wkend']=pd.to_datetime(data0.sale_date)
                   -pd.to_timedelta(
                     pd.to_datetime(data0.sale_date)
                      .dt.weekday+1,unit='D')
##==you need to apply dt to overcome the series error message.

##some other tricky tips to deal with datetime
from datetime import datetime    
row = "2011/1/1 0:30"
date1 = datetime.strptime(row, '%Y/%m/%d %H:%M')
print(str(date1))
#2011-01-01 00:30:00                                                                       
#===========================================================
 
Question: How do we get the SAS-retain similar statement in python?
Say, there are a few places missing value in the data, can we impute them to the previous one for the same ID? Here is an example:
#===========================================================
d = {'id':['AA','AA','AA','AA','BB','BB','BB','CC','CC'], 
     'value': [3, 6, 'NA',3,5,'NA',7,3,'NA']}
df = pd.DataFrame(data=d)
df.head(10)
df.index
##In reality, this "row" variable could be some other level##
df['row']=df.index

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())

m3a=pysql("""select * from df where value='NA' """)
m3b=pysql("""select * from df where value<>'NA' """)

m3c=pysql("""select a.id,a.row as row0,b.row as row1,
a.row-b.row as diff from m3a as a left join m3b as b
 on a.id=b.id and a.row>b.row """)

m3d=pysql("""select a.id,row0,min(diff) as diff 
from m3c as a group by 1,2 order by 1,2""") 

m3e=pysql("""select a.* from m3c a inner join m3d as b 
on a.id=b.id and a.row0=b.row0 and a.diff=b.diff""")  
 
m3f=pysql("""select a.id,b.row0,a.value from m3b as a
inner join m3e as b on a.id=b.id and a.row=b.row1""") 

m3g=pysql("""select a.id,a.row,a.value as value_original,
coalesce(b.value,a.value) as value from df as a left join
m3f as b on a.id=b.id and a.row=b.row0 order by a.id,a.row""")

#to combine multiple columns in panda dataframe, you can use fillna('') to coalesce: 
df['c'] = df.a.combine_first(df.b)  # use a first;  elseif a=NaN, then use b.
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
df.var_comb=df['var1'].fillna('')+df['var2'].fillna('')+df['var3'].fillna('')
#will have multiple values mixed

cols = ['keywords_0', 'keywords_1', 'keywords_2', 'keywords_3']
df["keywords_all"] = df[cols].apply(lambda x: ','.join(x.dropna()), axis=1) 
#join multiple keywords together with ',' separated. 

join two columns string together
a1['var1']=(round(a1.var1,0)).astype(str)+'_pcnt_'+round(a1.pcnt,2).astype(str)

#===========================================================
 
Question: How do we load data csv/excel into/output python?
Before we mention the load/upload files, pay attention to some code like:
file=r'C:\Diretory\Name_of_the_file_import.csv', what's the r for?
Character Function
r Open file for reading only. Starts reading from beginning of file. This default mode.
rb Open a file for reading only in binary format. Starts reading from beginning of file.
r+ Open file for reading and writing. File pointer placed at beginning of the file.
w Open file for writing only. File pointer placed at beginning of the file. Overwrites existing file and creates a new one if it does not exists.
wb Same as w but opens in binary mode.
w+ Same as w but also alows to read from file.
wb+ Same as wb but also alows to read from file.
a Open a file for appending. Starts writing at the end of file. Creates a new file if file does not exist.
ab Same as a but in binary format. Creates a new file if file does not exist.
a+ Same a a but also open for reading.
ab+ Same a ab but also open for reading.
#===========================================================
 #read csv file using panda.read_csv
#===========================================================
# Make sure pandas is loaded
import pandas as pd
file=r'C:\Diretory\Name_of_the_file_import.csv'
test = pd.read_csv(file)
pd.read_csv('data.csv', sep='|', index_col=False, 
             names=['protocol', 'server', 'type', 'value'])
#sometimes you might see some error message: "unicodedecodeerror utf 8 
#codec can t decode byte ** in position ** invalid start byte pandas" 
#then we need to add the decode method: 
test = pd.read_csv(file,encoding = "ISO-8859-1")
test = pd.read_csv(file,encoding="utf-8")
test.shape
test.head()

#one more tricky thing about read_csv() if you have some field with value: "asddg asdfg
#notice there is only one double quote, the 2nd double quote is missing. 
#in this case, you have to use some other option quoting=3 or quoting=csv.quote_none
ca1 = pd.read_csv(file,sep='|',delimiter='|',quoting=3)
ca1.shape
#if not doing this, you will get less records then expected. 
#the value in the field will be with only 1 double quote, 
#will not continue to looking for the 2nd double quote. 

#in case if you got the error message: DtypeWarning: Columns (3,5) have mixed types. 
#Specify dtype option on import or set low_memory=False.
#use the following dtype option to be object: 
df = pd.read_csv("gs://folder/data1.csv",sep='|',delimiter='|',dtype=object)
df = pd.read_csv("gs://folder/data1.csv",compression='gzip',sep='|',delimiter='|',dtype=object)

#when use pd.read_csv, there is option: na_values for null. #
#By default the following values are interpreted as NaN: 
#'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
# '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 
# 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'.

#====================================
 #save csv file using panda.to_csv
#====================================
file=r'/Users/****/Record_level_confidence/dataout/ca_output_by_year.csv'
file=r'C:\Diretory\Name_of_the_file_output.csv'
pd.DataFrame(dataname_to_output)
.to_csv(path_or_buf=file,index=False,header=True)
#index=False is not to output the column num(index) as one column 

#====================================
#read excel file into python
#====================================                    
import pandas as pd
df1=pd.read_excel("X:/XXX/.../YYYYYY.xlsx",sheet_name='name')
##don't use the other one: \ or \\ ##

#====================================
 #save results to excel sheets
#====================================
file=r'C:\directory\excel_file_name_you_want.xlsx'
writer = pd.ExcelWriter(file, engine='xlsxwriter')

result[0].to_excel(writer, sheet_name='Sheet1_name')
workbook = writer.book

worksheet = writer.sheets['Sheet1_name']
worksheet.set_zoom(90)
percent_fmt = workbook.add_format({'num_format': '0.000%', 'bold': False})
worksheet.set_column('A:G', 20)
worksheet.set_column('C:G', 20, percent_fmt)
#===========================================================

Question: How do we output import pickle file in python
Answer: output csv file is the most common way to save your output, but in python, you can also save your output using a python module called pickle. Pickle is used for serializing and de-serializing a Python object structure.

Why do we need to serializing?: Java serialization allows you to read/write arbitrarily complicated Java objects, automatically or manually from/to disk or from/to the network. If you are using csv or any other delimiter format to transfer the file, you will always have unexpected troubles. Any object in Python can be pickled so that it can be saved on disk.

What pickle does is that it "serializes" the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script. The disadvantage of this pickle format is that, only python can read and write, and it might have compatibility issue between different python versions.
#================================================
# take user input to take the amount of data
number_of_data = int(input('Enter the number of data : '))
data = []

# take input of the data
for i in range(number_of_data):
    raw = input('Enter data '+str(i)+' : ')
    data.append(raw)

# open a file, where you ant to store the data
file = open('data_name', 'wb')

cwd = os.getcwd()
print(cwd) #this will be the location of the output pklz file
# dump information to that file
pickle.dump(data, file)
# remember to close the file, otherwise trouble
file.close()

#in case if you got some error message: 
#OverflowError: cannot serialize a bytes object larger than 4 GiB
#then you can try the following version: 
pickle.dump(d, open("file", 'w'), protocol=4)

cwd = os.getcwd()
cwd1 = os.path.join(cwd,'data_break/')
    
m=0    
for cntycd1 in list_cntycd2:
    m=m+1
    time_start = datetime.datetime.now()
    #temp0=dfa[(dfa.cntycd==cntycd1)]
    temp0=dfb.loc[(dfb.cntycd==cntycd1)]     
    file1=cwd1+'Datain_prep_cnty_'+cntycd1    
    file = open(file1, 'wb')
    cPickle.dump(temp0, file, protocol=4)
    file.close()
    time_end = datetime.datetime.now()
    time_mins=(time_end-time_start)
    print("round", m, " for cnty:", cntycd1, " : seconds taking: ", 
    time_mins, ' seconds of size data: ', temp0.shape[0], ' rows')

#or use the following:
import gzip   #use gzip to compress the file.
file=r'C:\folders\historical_50_major_daily_update_all'
with gzip.open(file, 'wb') as f_model_pkl:
    pickle.dump(test2, f_model_pkl)

import _pickle as cPickle #cPickle is much more fast
file1 = cPickle.load(gzip.open(file, 'rb'), encoding="ASCII")     
file1.shape    
file1.head()  
# =============================================================================
# #Python Pickle load
# =============================================================================
# open a file, where you stored the pickled data
file = open('important', 'rb') #opening the file in read-binary (rb) 
file
data = pickle.load(file)
data
# close the file
file.close()   
#===========================================================
The main difference between cPickle and pickle is performance. The cPickle module is many times faster to execute because it's written in C and because its methods are functions instead of classes.

Sometimes you might get some error message like this: _pickle.UnpicklingError: invalid load key, This may not be relevant to your specific issue, but I had a similar problem when the pickle archive had been created using gzip. For example if a compressed pickle archive is made like this,
#===========================================================                     
import gzip, pickle
with gzip.open('test.pklz', 'wb') as ofp:
    pickle.dump([1,2,3], ofp)

with open('test.pklz', 'rb') as ifp:
     print(pickle.load(ifp))
#===========================================================
Trying to open it throws the errors: Traceback (most recent call last): File "", line 2, in _pickle.UnpicklingError: invalid load key, ''. But, if the pickle file is opened using gzip, it solves the problem:
#=========================================================== 
with gzip.open('test.pklz', 'rb') as ifp:
    print(pickle.load(ifp))                                             
#===========================================================

Question: How do we create empty dataframe and then fill values ?
Sometimes we want to create an empty dataframe and then assign values/rows/data into the dataframe, like the following code, but the output is very strange:
#================================================
df1=pd.DataFrame(columns=['var1'])
df1['var1']='something1'
print(df1)  #the output is nothing, empty!
#===========================================================
You need to have at least one actualy row to start assign more values/rows into that dataframe, here is one way you can do:
#================================================
#need to have at least one row!
all0=pd.DataFrame(columns=('state','cntycd'),index=range(1))
all0=pd.DataFrame(index=range(1))
all0=all0[0:0] 

#or df1=pd.DataFrame({'var1':['so
mething1']})  
# {} is for the key-value pair, [] for the list of values.
#or df1=pd.DataFrame({'var1':['something1'],'varz':['so******']})
df1['var1']='something2'
print(df1) 
#===========================================================

Question: How do we subset dataframe in python?
#================================================
# Method best: select via columns number========
#very convenient to use iloc to subset data: ====
#select ith column: Python uses 0-based indexingall0.iloc[:,i-1]

# select rows of 0 and 10 (not from 0 to 10)
test.loc[[0, 10], :]

#select 4 columns in different order
test1=test.iloc[:,[0,1,-1,6]]

#To select rows whose column value is in list 
years = [1952, 2007]
gapminder.year.isin(years)

#select unique values
gapminder_years.year.unique()
array([1952, 2007])

#select the ones not from the list of values
continents = ['Asia','Africa', 'Americas', 'Europe']
gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)]
gapminder[~gapminder.continent.isin(continents) & 
           gapminder.year.isin(years)]
           
df0=test[['date2,''open','low','high','close','volume','symbol']][test.symbol=='LABD']

#drop the columns BC, they are equilent
df.drop(['B', 'C'], axis=1)
df.drop(columns=['B', 'C'])
#axis=1 for columns, axis=0 for rows

#rename one column name
all2.rename(columns={ "off-white": "off_white"}, inplace=True)

#change data types, using astype() or to_numeric()
df['b'].astype(float)
df['b'].astype(str)
#in case if some error: "could not convert string to float", use  to_numeric()
df['b'] = pd.to_numeric(df['b'], errors='coerce')

#in case if you want to convert to interger: 
df['b'] = pd.to_numeric(df['b'], errors='coerce',downcast==‘integer’)
#downcast : {‘integer’, ‘signed’, ‘unsigned’, ‘float’} , default None

all1['from2a']=all1['from1'].str.find("<")+2
all1['from2b']=all1['from1'].str.find(">")+1
all1['to2a']=all1['to1'].str.find("<")+2
all1['to2b']=all1['to1'].str.find(">")+1
all1['from2']=all1['from1'].str[pd.to_numeric(all1['from2a'], 
errors='coerce'):pd.to_numeric(all1['from2b'], errors='coerce')]
all1['from2']=all1['from1'].str[18:50]

#rename all column names
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(index=str, columns={"A": "a", "B": "c"})

# select rows 0, 1, 2 (row 3 is not selected)
test[0:3]
#test[starting_num-1:num_of_records]
 
# Method 1: select a 'subset' of the data using the column name
test[['var1', 'var2']]

# Method 2: use the column name as an 'attribute'
test.species_id

# select the first 5 rows (rows 0, 1, 2, 3, 4)
test[:5]

# select the last element in the list
# (the slice starts at the last element,
# and ends at the end of the list)
test[-1:]   

#create dummies via where statement
test1['new_var']=np.where(test['var_check']>=5,1,0)

data1[data1['col2'].isna()]
data1[data1['col1'].notna()]

#stack 2 data frame together
pd.concat([df1, df2])
pd.concat([df1, df2]).drop_duplicates()

#remove duplicated records from test2 
test2.drop_duplicates(subset=['symbol','date1','volume'], 
keep='first', inplace=True)

#subset dataframe via muitiple filters
test2=test[(test['var1_check']==1) 
         & (test['saledate']<=holdout + timedelta(days=7*i))]
test1=test1.rename(columns={"old_var_name":"new_var_name"})

##define columns name for a data frame
df.columns = ['a', 'b']

##create a empty dataframe, then add rows using append## 
ref_data=pd.DataFrame(columns=['Coefficient','Variable'])
ref_data=ref_data.append({'Coefficient':0.0, 'Variable': 
"C(var1, Treatment(reference=ref_var))[T."+str(ref_var)+"]"}, 
ignore_index=True)
                   
f=pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)  

#join two lists of strings together:
mergedlist=list(vars3.var2)+["price_per_sq_imp"]
mergedlist

#remove something from list of items
list_col0.remove(list1)
#in case if you got some error: can't operand between list and list, use set(): 
list_col1=set(list_col0)-set(list1)

mergedlist = []
mergedlist.extend(list(vars3.var2))
mergedlist.extend(["price_per_sq_imp"])
mergedlist 

df1=df.sample(n=3, random_state=12345) #sample of 3 records
df2=df.sample(frac=0.5, replace=True, random_state=1) #sample of 50% of records

Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are more likely to be sampled.
df3=df.sample(n=2, weights='num_specimen_seen', random_state=12345)
#================================================
Some of the tutorials are referenced from here.
#================================================
 data1.categorical_var1.value_counts()
 data1.numerical_var1.describe()
 data1.numerical_var1.isnull().sum()
 print(" dataset has {} data points with {} variables each.".format(*data0.shape))
 data0=shf[(shf['var1'].notnull()) | (shf['var2'].notnull())  ]
 #use isnull() or notnull() to select
#================================================
 
Some examples about handing string operation:
#================================================
####====Find the substring from a string====####
data1=test[test1['Var1'].str.contains('some',regex=False)]
#regex=True, assumes the pat is a regular expression.
#regex=False, treats the pat as a literal string.
##Most of time, just use False we care about string.##

#find the position of some (special) character
all1['from2a']=all1['from1'].str.find("<")+2
all1['from2b']=all1['from1'].str.find(">")+1
t1c=pysql(""" select *,substr(from1,from2a,from2b-from2a) as from2 from all1 as a """)
#to get the email address.

#get substring(first 10 charaters) from a column in dataframe
df['New_var'] = df['var'].str.slice(0,9)
t1['sale_yr']=t1['saledt'].astype(str).str.slice(0,4)

s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.NaN])
s1.str.contains('oG', case=True, regex=False)
0    False
1    False
2    False
3    False
4      NaN
 
#Returning any digit using regular expression.
s1.str.contains('\d', regex=True)

#return anything followed by 0
s2 = pd.Series(['40','40.0','41','41.0','35'])
s2.str.contains('.0', regex=True)
0     True
1     True
2    False
3     True
4    False

##== we can also use extract ==## 
s1= Series(['a1', 'b2', 'c3'])
s1.str.extract(r'(?P[ab])(?P\d)') 

s2 = Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
s2.str.extractall(r"(?P[ab])(?P\d)")

date=datetime.today()
pdffile=r'C:\Users\...\Python_test_'+str(date)[:10]+'.pdf'
#use += to join 2 string together

#CONCAT(string1,string2) only for 2 strings in sql, not in python##
CONCAT(CONCAT(string1,string2),string3) as string_new
CONCAT(string1,nvl(string2, 0)) as string_new

#substring(str1,start_posi(1,2,...),#_of_chars) as str_new in sql, not in python
substring(string1,5,3) as string_new

 x = "Hello World!"
 x[2:]
'llo World!'
 x[:2]
'He'
 x[:-2]
'Hello Worl'
 x[-2:]
'd!'
 x[2:-2]
'llo Worl'
#get the reverse/backward  of string: 
some_string[::-1]
#select every other letter
"H-e-l-l-o- -W-o-r-l-d"[::2] # outputs "Hello World"

##use .join function to join a few strings##
music = ["Metallica", "Rolling Stones", "ACDC"]
"_".join(music)  

test='asdgasdg H asdfghHHHHLLL H '
test.lower()        
test #test will not change
test1=test.replace(" H "," K ")        
test1
test2=test.find('HH')
    
###########################################################    
#compared the three methods by speed: 
#string1.replace(some_string1,new_string) > re.sub > translate.
#translate is slower than re.sub (with precomilation) in about 10 times. 
And str.replace is faster than re.sub in about 3 times.
#using re package to remove/replace puncations  
import re
def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and space 
                                        
        w = re.sub(r'\_','',w)  #to remove underscore as well
        new_words.append(w)
    return new_words
  
##########################################################
new_data=[]
with open('/home/rahul/align.txt','r') as f:
    f1 = f.read()
    f2 = f1.split()
    all_words = f2 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' 
    # You can add and remove punctuations as per your choice 
    #removing stop words in hungarian text and  english text and 
    #display the unpunctuated string
    # To remove from a string, replace new_data with new_str 
    # new_str = "My name$#@ is . rahul -~"

    for word in all_words: 
        if word not in punctuations:
           new_data.append(word)

    print (new_data)
      
#use the following to find out details about special characters      
#https://docs.python.org/3.1/library/re.html
#remove all the emails address
test1='sadhiong  sdhtihg@asdhilg.com  sdhi  asdg'
test2=re.sub('\S*@\S*\s?', '', test1) 

#here is some good way to replace substring from a list of strings:
words = [w.replace('[br]', '
') for w in words]

# Convert to list
data = df.content.values.tolist()
# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
#\s is for any Unicode whitespace characters, which includes [ \t\n\r\f\v].
#\S is the opposite to \s: for any Non-(Unicode whitespace characters)
#\w: any Unicode word characters: includes most characters that can be part of a word 
#in any language, as well as numbers and the underscore.
#\W: the opposite of \w
  
# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]
#================================================
 
Queston: How do we apply map lambda function in python? Examples of doing loop in python?
#================================================
xs = [(x+1) * 0.01 for x in range(9)]
xs

[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]

##first load data from AWS S3 bucket
return_df = pd.read_csv('s3://{bucket}/{key}'.format(bucket='bucketname', 
key='the file name'), 
compression='infer', header=0, sep=',', quotechar='"', dtype=dtype)

dfb=dfa.set_index(keys='cntycd',drop=False,append = False)

data1.reset_index('cntycd',drop=False,inplace=True) 
#set the index of dataframe to be some column 
#drop is to drop that column if True, inplace is to replace the data

return_df.reset_index()

return_df.reset_index(drop=True) 
#drop=True to avoid the previous index to be added as column 
Series.reset_index(level=None, drop=False, name=None, inplace=False)
#reset index for the series, return dataframe if drop=False
return_df = return_df.sort_values('var1',inplace=True)
return_df = return_df.sort_values(by=['var1', 'var2'],ascending=[True,True])

data.drop_duplicates(subset ="First Name",  #by default, use all the columns.
                     keep = False/first/last, inplace = True) 
#first: drop duplicates except for the first occurrence.   

return_df = return_df.set_index(['var1', 'var2'])
c = mileage_coef_lookup.columns.tolist()
##apply: map(function, iterable): the iterable will be the list c.
return_df.columns = list(map(lambda x: x + '_function', c))
#the output looks like: 'area_function','color_function' ##

#map(fun, iter): map() function returns a list of the results after applying 
#the given function to each item of a given iterable (list, tuple etc.)
def addition(n): 
    return n + n 
  
# We double all numbers using map() 
numbers = (1, 2, 3, 4) 
result = map(addition, numbers) 
print(list(result)) 
#output: {2, 4, 6, 8}

numbers = (1, 2, 3, 4) 
result = map(lambda x: x + x, numbers) 
print(list(result)) 

numbers1 = [1, 2, 3] 
numbers2 = [4, 5, 6] 
result = map(lambda x, y: x + y, numbers1, numbers2) 
print(list(result)) 

#================================================
map creates a new list by applying a function to every element of the source:
#================================================
xs = [1, 2, 3]

# all of those are equivalent: the output is [2, 4, 6]
# 1. map
ys = map(lambda x: x * 2, xs)
# 2. list comprehension
ys = [x * 2 for x in xs]
# 3. explicit loop
ys = []
for x in xs:
    ys.append(x * 2)
#================================================
 
Here is the loop from (0,i-1) and (i+1, n), joining them together as one list to loop via itertools.chain.
#================================================
 cnt=0
for i in range(len(list1)-1):
    for j in itertools.chain(range(0,i), range(i+1,len(list1))):
#    for j in range(i+1,len(list1)) & (j!=i):
        cnt=cnt+1
        print('running for ', cnt, ' ',list1[i],'  --> ',list1[j])
        var1=list1[i]
        var2=list1[j]
list(range(2,8,3)) #output [2, 5], not include the ending number 8 
#================================================

Queston: How do we reshape/tranpose/pivot data in python?
#================================================
#transpose or transform data from horizontal to vertial
data2=data1.pivot_table(index=['by_var1','by_var2'], 
columns='var_name', values='var_value')
#values might not necessary to be there.
#after you applied, you will get data2 with multiindex, reset_index afterwards.
data2.reset_index(inplace=True)
##we can add aggfunc(mean/sum), default: numpy.mean.

#in case if you got the error: DataError: No numeric types to aggregate, 
#use aggfunc='first'
test1b=test1a.pivot_table(index=[var1,var2,var3],columns='focus',
values='focus_value',aggfunc='first')

#or just use pivot for single index#
data2=data1.pivot(index='by_var1', 
columns='var_name', values='var_value')

#or use group by and sum/mean functions#
df2=df.groupby('weekday').mean() #or this following
df2=df.groupby('weekday').agg(np.mean)
df2=df.groupby('weekday').agg(np.median)
df2=df.groupby(['symbol', 'weekday']).mean()
data2=data1.groupby(level=1, axis=1).mean()

sh1a=sh.groupby([var2,var3],as_index=False).agg({"clip": 'count'})
.sort_values(['clip'],ascending=False) 
#================================================
 
To get the data back from vertical to long, we can use melt function:
#================================================
 new_data=pd.melt(df1, id_vars='var_as_by_category', 
 value_vars=[2010, 2011, 2012]) ##select the ones need to convert
#================================================
  
Queston: How do we drop big dataframe and release memory in python?
Answer: After we have run a series of programs, there might be quite some dataframe/series left in the memory, some of them might be huge. how do we drop those dataframe and release the memory from the system, so we can recycle the limited memory, is a very tricky technique we need to pay attention when deal with large dataset.
#===========================================================         
##check the size of the data frame in terms of MB##
import sys
sys.getsizeof(data1)
##output in terms of Byte, change to MB by ##
print('The size of the file is ',
round(sys.getsizeof(data1)/(1024.0*1024.0),1), ' MB')

#code to list all dataframe in memory
%whos DataFrame 

data1.info()
data1.info(memory_usage='deep')

##2nd way of analyzing dataframe memory usage
data1.memory_usage(deep=True)/(1024.0*1024.0)
np.sum(data1.memory_usage(deep=True)/(1024.0*1024.0))
##the sum should be equal to the  sys.getsizeof()##

##Technically memory is about this ##
df.values.nbytes + df.index.nbytes + df.columns.nbytes

#here is to delete dataset and release memories.
import gc  ##for garbage collection##
del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
 
##if you just run: del data1 might not release the memory##
##put them in a list, delete list will triggle garbage collected##
lst1 = [df1, df2]
del lst1 
#===========================================================

Question: How do we save the log to some file in python??
import os
cwd = os.getcwd()
print(cwd)

import logging
import logging.handlers

log = logging.getLogger()
log.setLevel(logging.INFO)

fh = logging.FileHandler(filename='jk_test_log2.txt')
fh.setLevel(logging.INFO)
fh.setFormatter(logging.Formatter(
                fmt='%(asctime)s %(levelname)s: %(message)s',
                datefmt='%Y-%m-%d %H:%M:%S'
                ))
log.addHandler(fh)

log.info('-------Start--------')
log.info('this function is doing something')
log.info('this function is finished')
log.removeHandler(fh) 
del log,fh

Question: How do we do some quick summary statistics in python?
Answer: we can use the following functions get quick summary statistics for a dataframe.
#================================================
 ##==check the frequency of the variable==##
 data1.categorical_var1.value_counts(dropna=True)
 
 ##==check the unique value of  variable==##
 data1.numerical_var1.unique()
 
 ##==check the distribution of the variable==##
 data1.numerical_var1.describe()
 
 ##==check the count of null of the variable==##
 data1.numerical_var1.isnull().sum()
 print(" dataset has {} data points with {} variables.".format(*data0.shape))
 
 ##example of rolling average, min,max,cummulative sum
df['L9'] = df['low'].rolling(window=9).min()
df['H9'] = df['high'].rolling(window=9).max()
df['RSV'] = 100*((df['close'] - df['L9']) 
               / (df['H9'] - df['L9']) )
df['k'] = df['RSV'].rolling(window=3).mean()
df['d'] = df['k'].rolling(window=3).mean()
df['Cumsum_spy']=df['year_ret'].cumsum()

#maximum and minimum of "X" and "Y" columns (for each row)
df[["X", "Y"]].max(axis=1)
df[["X", "Y"]].min(axis=1)
data1['max1'] = np.maximum.reduce(data1.iloc[:,[-1,-2,-3,-4]].values, axis=1) 
#max of 4 columns.
data1['max1'] = np.maximum.reduce(data1.iloc[:,list(range(9,140,2))].values, axis=1) 
#max of more columns, staring with 9, then every other item
data1a=data1.iloc[:,[0,1,2,3,4,5,-1,-2,-3,-4,-5] #rearrange/reorder the columns
df['var_new']=df['var'].clip(min1,max1)
df['var_new']=df['var'].clip_upper(max1)
df['var_new']=df['var'].clip_lower(mim1)
#================================================
Notice all those functions, value_counts(),unique(),isnull(),notnull(),sum(), they all have bracket together. In fact, we can define a function to do a few summary statistics for any dataset:
#================================================
 def check(data0):
    print('**********************************************************')
    print(' Num of records: {}  and  Num of variables: {} '.format(*data0.shape))
    print('\n Num of Null records: \n', data0.isnull().sum())
    print('\n List of variables: \n', list(data0))
    print('\n Variable data types: \n ', list(data0.dtypes))    
    print('**********************************************************')
    print('\n  The first 5 records: \n ', data0.head())
    print('**********************************************************')
    print('\n  The last 5 records: \n ', data0.tail())
    print('**********************************************************')
    print('\n  Variable distribution: \n ', data0.describe())
        
    return 
 check(data)   
#================================================

Python coding for different layer by different indent, so if you have coded some lines of codes without any indent, then you decide to indent furthrer, for example, if you decided to include them in a defined function, then you need to indent those lines of codes together, it could be hundreds of lines of codes, do you want to indent line by line? Difinitely not a good choice. You can highlight/select the lines you want indented, then press TAB as often as needed until they reach the proper indent level. You can remove spaces with SHIFT TAB.

You can also use CTRL+ALT+I to auto-indent the selection. If you're using IDLE (or Python notebook .py file), you can use Ctrl+] to indent and Ctrl+[ to unindent.(On macbook, use Command + [ to unindent. )

Question: How do we load R datasets/objects into/output python?

#===========================================================
#===You need to install and use rpy2 package===============
#===rpy2 is an interface to R running embedded in Python.===
#===use the followoing code to install via anaconda prompt===
conda install -c r rpy2 
#===pip install rpy2 might not able to install sometimes===
import rpy2
print(rpy2.__version__)
#===========================================================
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
#===========================================================
readRDS = robjects.r['readRDS']
df = readRDS('my_file.rds')
df = pandas2ri.ri2py(df)
#===========================================================

Question: What happened if you see the error message:ValueError: x and y must be the same size, or valueerror: x and y must have same first dimension when you tried to do some plot in python?

Answer: One of the challenges for Python users are there are so many data formats: array, list, series, data frame, etc. we need to take extra caution for this.

When we tried to plot or scatter plot:
import matplotlib.pyplot as plt plt.figure(figsize=(10,8)) plt.scatter(data1.var_x,data1.var_y,color='green')
plt.show()


One of the direct checking method for ValueError: x and y must be the same size , simply try to find out the dimension of x, y:
print(data1.var_x.shape)
print(data1.var_y.shape)

to see if they are the same.

One of my funny but painful experience is the following: I have used pysql to join 2 data frames, and some of the variables share the same variable name in both data frames, and I used: select a.*,b.* statement to get all the variables.

For example, if var_y showed up in both data frame 1 and data frame 2, then there will be two columns in the new data frame using the same variable name var_y, now if you tried to do plot:
plt.scatter(data1.var_x,data1.var_y,color='green')
most likely you will get the previous error message, because you have 2 columns for var_y, meanwhile you only have one column of var_x. So to take care of the issue, the best practice is to rename to different variable name.

Similarly, for the error message: valueerror: x and y must have same first dimension, we can use len(x), len(y) to check the dimension.

Queston: say we have a list of dataframes, i.e. each element in the list is a dataframe with same variable names, how do we combine them into one dataframe?
#===========================================================
global New_dataframe
New_dataframe=pd.DataFrame()
for i in range(len(Old_list)):
    New_dataframe=New_dataframe.append(Old_list[i])
New_dataframe.shape    
New_dataframe.head()
#===========================================================
Question: How do we download data from amazon redshift?
Answer: we are using the package sqlalchemy to download the data from redshift. Sqlalchemy is a SQL library under the licence from MIT.

First, you need to install something very special: sqlalchemy-redshift(not pandas_redshift) via anaconda prompt: pip install sqlalchemy-redshift otherwise, you might get some error message like the following: "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2",

After you install that package, restart your python kernal/spyder program, the error message should go away. In case if you are using the jupyter notebook, use this code to install: !pip install sqlalchemy-redshift
#===============================================================
import sqlalchemy as sa
import pandas as pd
#import sqlalchemy_redshift  
#import psycopg2

engine_setup = sa.create_engine('redshift+psycopg2://
{user}:{password}@{host}:{port}/{dbname}'.format(
                            dbname=DATABASE, 
                            user=USER, 
                            password=PASSWORD,
                            port=PORT,
                            host=HOST),
                    encoding='utf-8',
                    connect_args=
{"keepalives": 1, "keepalives_idle": 60, "keepalives_interval": 60} 
                        )
                       
sql_query = """select var1,var2 from data0 where var_filter = '****' """
data_in=pd.read_sql_query(sql_query, con=engine_setup)
data_in.shape
#===============================================================
Here is an example of create a table inside redshift, distributing data evenly.
#===============================================================
eng = sa.create_engine('redshift+psycopg2://
  {user}:{password}@{host}:{port}/{dbname}'.format(
  dbname=DATABASE, 
  user=USER, 
  password=PASSWORD,
  port=PORT,
  host=HOST),
# execution_options={'autocommit': True},
  encoding='utf-8',
  connect_args={"keepalives": 1, "keepalives_idle": 60,
   "keepalives_interval": 60}  
#   isolation_level="AUTOCOMMIT"
                          )
eng.execute('DROP TABLE IF EXISTS dataset1')                         
eng.execute("""create table dataset1 
    DISTSTYLE EVEN as
    select var1,
    dateadd(day,-365,current_date) as ref_date,
    count(id) over (partition by id 
    order by sale_date desc rows unbounded preceding) as rank
    ##create a rank variable for each tran with the same id##
    from dataset0
    where var_filter= var_used
      """.format(var_used=****) ) 
#===============================================================
If you have created a table/dataset inside redshift, some othe people might have permission issue to access, here is the code you can use to grant the access:
#===============================================================
##grant access to all tables from schema "public" to someone##
##it might stop running due to one error message!## 
grant select on all tables in schema public to some_other_id

##you can use the following to create the grant list##
##this is extremely useful in permission issue###
select 'grant select on table projectz.public.'||tablename
||'  to skasdf;' from pg_tables where tableowner='kasdli';

##to solve the permission issue from the root: 
  ;alter default privileges for user sk****, principal
    in schema public 
    grant all privileges on tables 
    to group user_group_name1;

##grant access to some table in some database to  someone##
select current_user;
grant all on all tables in schema public to PUBLIC
grant select on table database.public.some_other_tabe  to some_other_id

##in aginity redshift if you want to see the table list:
select *  from pg_tables
select *  from PG_TABLE_DEF

##in aginity redshift if you want to see the column list:
select * from pg_table_def
where table_name='uasdfv_uadt_ddsat9';

##or use this one:
select * from information_schema.columns 
where table_name='uasdfv_ddsat9';

##use the following code to the size of each dataset##
SELECT name AS table_name, 
ROUND((COUNT(*) / 1024.0),2) as "Size in GB"
FROM stv_blocklist
INNER JOIN
(SELECT DISTINCT id, name FROM stv_tbl_perm) names
ON names.id = stv_blocklist.tbl
GROUP BY name
ORDER BY "Size in GB" DESC

##in aginity netezza if you want to see the table list:
select * from _v_table_storage_stat

#rename the table: 
alter table old_name
rename to new_dataset;

#drop a column: 
alter table data1
drop column var1;

#rename a column: 
alter table data1
rename column old_name to new_name;

#change data type
cast(old_column as real) as new_name

####change owner of table
alter table table_name
owner to new_owner;
#===============================================================

Question: How to replace sustring via SQL in redshift ?
Use REPLACE function to replace the entire string with another string, for example,
if I want to replace the double space by a single space, I can use the following:
#===============================================================
select var1,REPLACE(var1, '  ', ' ') as var2,count(*) as cnt
 from data1 
 where REGEXP_REPLACE(var1, '[^0-9a-zA-Z]+', '_')='Oldsmobile_Silhouette' 
 group by 1,2
 order by 1,3 desc
###replace "Shows" in the string with "Theatre"
 replace(catgroup, 'Shows', 'Theatre')
#===============================================================
#substitute any character that is non 0-9a-zA-Z by a single "_"
select REGEXP_REPLACE(vehicle_map, '[^0-9a-zA-Z]+', '_') 
##deletes the @ and domain name from email addresses
select regexp_replace( email, '@.*\\.(org|gov|com)$') 
##replaces the domain names with internal.com
regexp_replace(url, '^.*\\.[[:alpha:]]{3}/', 'internal.com/')
##e.g.example.com/cuisine/locations/home.html changed to: 
internal.com/cuisine/locations/home.html
##substitute with the corresp. letters:
select translate('mint tea', 'inea', 'osin');
##will output to: most tin
#===============================================================

REPLACE is similar to the TRANSLATE Function and the REGEXP_REPLACE Function, except that TRANSLATE makes multiple single-character substitutions and REGEXP_REPLACE lets you search a string for a regular expression pattern, while REPLACE substitutes one entire string with another string.

If you need to upload data from S3 bucket in AWS, then you need to create the schema/template for the data first in redshift, here are the code to generate the schema, you can run that in aginity redshift, or use aginity ==> script==> DDL to query window:
#===============================================================
drop table IF EXISTS public.dataset1;
CREATE TABLE public.dataset1
(
	tran_id BIGINT ENCODE mostly32,
	vin VARCHAR(256) ENCODE lzo,
	sold_dt VARCHAR(256) ENCODE lzo,
	drivetrain VARCHAR(65535) ENCODE lzo
  #if you are not sure, use all VARCHAR(256) unless for really long string 
)
DISTSTYLE EVEN;
#===============================================================
Then you can run the following code in 2 different environment: 1. Continue run in the aginity redshift environment, this will be much faster:
#===============================================================
delete from data_simple
copy data_simple
from 's3://S3_buket/Transaction_data_simple.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::023456778891:role/RedshifttoS3'
csv 
IGNOREHEADER 1    # handle the csv file header
gzip # to hanle the zipped csv file
IGNOREBLANKLINES    #to handle the blanklines error
FILLRECORD   # to autofill the empty fields in the data
DELIMITER ','
NULL AS 'NULL' 
DATEFORMAT as 'YYYYMMDD';
#===============================================================
2. In the python environment, either spyder or jupyterlab to upload csv or manifest from S3 to redshift, seems slow and might have mistakes.
#===============================================================
fs = s3fs.S3FileSystem()
eng = sa.create_engine('redshift+psycopg2://{user}:{password}
@{host}:{port}/{dbname}'.format(
  dbname=DATABASE, 
  user=USER, 
  password=PASSWORD,
  port=PORT,
  host=HOST),
  execution_options={'autocommit': True},
  encoding='utf-8',
  connect_args={"keepalives": 1, "keepalives_idle": 60,
   "keepalives_interval": 60}  
#   isolation_level="AUTOCOMMIT"
                          )

#example uploading csv data
#better run this in the spyder environment, in the sagemaker/jupyterlab might not work
                          
eng.execute("""delete from dataname1""")
eng.execute("""copy dataname1
from 's3://S3_bucketlocation/data1.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::#######:role/RedshifttoS3'
csv 
IGNOREHEADER 1
IGNOREBLANKLINES
FILLRECORD
DELIMITER ','
NULL AS 'NULL'
GZIP #for the compressed file in format .gz 
DATEFORMAT as 'YYYYMMDD';
""")                         
            
#example uploading manifest data                          
eng.execute('DROP TABLE IF EXISTS dataset1')                         
eng.execute("""copy unified_transactions_ext
from 's3://S3_bucketlocation/data_extmanifest' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567:role/RedshifttoS3'
MANIFEST 
DELIMITER '|' 
NULL AS 'NULL' 
DATEFORMAT as 'YYYYMMDD'
ESCAPE;
""") 
#===============================================================
Here is the warp-up of upload data onto redshift via: 1. From a python dataframe 2. From a csv file in S3 bucket.
#===============================================================
import psycopg2
from psycopg2.errors import *
import pandas_redshift as pr
import pandas as pd
from boto3 import client

class UploadToRedshift:
    def __init__(self, schema = "public", dbname = "projectz", port = "5439", user = "asdhgig", \
                db_password = "123456789", db_url = "**.us-east-1.redshift.amazonaws.com",\
                aws_access_key_id = "*******", aws_secret_access_key = '*******',\
                transfer_bucket = "data1", 
        aws_iam_role = "arn:aws:iam::*******:role/RedshifttoS3"):
        self.schema = schema
        self.dbname = dbname
        self.port = port
        self.user = user
        # redshift password
        self.password = db_password
        self.host_url = db_url
        # aws iam role which have access to 
        self.aws_iam_role = aws_iam_role
        # Long term aws_access_keys, only work for s3 bucket uv-prod-batch
        self.aws_access_key_id = aws_access_key_id
        self.aws_secret_access_key = aws_secret_access_key
        # Intermediate bucket. Only files in S3 EMR Dynamo DB can be loaded to Redshift,
        # so we need this as transfer hub.
        self.transfer_bucket = transfer_bucket
        
    
    def upload_from_s3_to_redshift(self, table_name, file_path,  newtable = False, 
                          query_gen = "", delimiter = ",",  skip_header = 1):
        """Query_gen is used to create new table before upload"""
        
        '''This method will unload redshift table into S3'''
        
        conn_string = "dbname='{}' port='{}' user='{}' password='{}' host='{}'"\
            .format(self.dbname, self.port, self.user, self.password, self.host_url)
        
        try:
            con = psycopg2.connect(conn_string)
            print("Connection Successful!")
        except:
            print("Unable to connect to Redshift")

        cur = con.cursor()
        
        # If it's not in redshift, need to create one before upload to redshift
        if newtable:
            try:
                cur.execute(query_gen)
                print("NEW table created")
            except Exception as e:
                print(e)
                print("Failed to create new table")
                con.close()

        query_copy = """copy {}.{} 
                        from '{}'
                        iam_role '{}'
                        DELIMITER '{}' 
                        IGNOREHEADER {} 
                        ACCEPTINVCHARS EMPTYASNULL ESCAPE COMPUPDATE OFF;commit;"""\
                        .format(self.schema, table_name, file_path, \
                                self.aws_iam_role, delimiter, skip_header)

        try:
            cur.execute(query_copy)
            print("File {} was uploaded".format(file_path.split("/")[-1]))
        except InternalError_ as e:
            print("Create table " + table + " before copy to it")
        except Exception as e:
            print(e)
        con.close() 


    def pd_dtype_to_redshift_dtype(self, dtype):
        # Modified pd_dtype_to_redshift_dtype in pr package to avoid overflow.
        if dtype.startswith('int64'):
            return 'BIGINT'
        elif dtype.startswith('int'):
            return 'INTEGER'
        elif dtype.startswith('float'):
            return 'FLOAT'
        elif dtype.startswith('datetime'):
            return 'TIMESTAMP'
        elif dtype == 'bool':
            return 'BOOLEAN'
        else:
            return 'VARCHAR(256)'

    def upload_pandas_df_to_redshift(self, dataframe, table_name):
        pr.connect_to_redshift(dbname = self.dbname, 
                                user = self.user, 
                                password = self.password, 
                                port = self.port, 
                                host = self.host_url)
        pr.connect_to_s3(aws_access_key_id= self.aws_access_key_id, 
                     aws_secret_access_key= self.aws_secret_access_key, 
                     bucket = self.transfer_bucket, 
                     aws_iam_role = self.aws_iam_role)

        new_dtypes = []
        for dtype in dataframe.dtypes:
            new_dtypes.append(self.pd_dtype_to_redshift_dtype(str(dtype)))

        try:
            pr.pandas_to_redshift(data_frame = dataframe, redshift_table_name = table_name, 
                                 save_local= False, column_data_types = new_dtypes)
            print("Dataframe was uploaded to redshift as {}".format(table_name))
            
            conn = client('s3')  
            for key in conn.list_objects(Bucket= self.transfer_bucket)['Contents']:
                if key["Key"].startswith(table_name):
                    conn.delete_object(Bucket = self.transfer_bucket, Key = key['Key'])
            print("Temp file in S3 has been deleted")
        except Exception as e:
            print("ERROR: {}".format(str(e)))

def data_to_s3_csv(s3_location, df):
    
    fs = s3fs.S3FileSystem()
    
    with fs.open(s3_location, "wb") as f:
        bt = df.to_csv(None, header=True, index=False).encode();
        f.write(bt)
                  
import importlib
upload_redshift = importlib.import_module('.upload_to_redshift', package='dsv-aws-common')
upload_redshift1 = importlib.import_module('dsv-aws-common.upload_to_redshift')
#upload_redshift is the same as  upload_redshift1
UploadToRedshift=upload_redshift.UploadToRedshift

import pandas as pd
#data = pd.read_csv(file1)
upload1 = UploadToRedshift()
upload1.upload_pandas_df_to_redshift(dataframe=all0, table_name = "****_scoring_avg_all")

file1 = os.path.join('s3://uv-prod-****/a_uv_****/output/', '****_scoring_avg_all.csv')
data_to_s3_csv(file1,all0)

file_path = "s3://s3_bucket_name/data_name.csv"
table = "daily_avg_test"

sql_create = """CREATE TABLE IF NOT EXISTS {}(
                MID BIGINT,
                EDITIONDATE INT,
                AuctionAvgMile FLOAT,
                StdDevMileage FLOAT,
                AvgInsightCond FLOAT,
                StdDevPrice FLOAT,
                BasePrice FLOAT);""".format(table)

upload.upload_from_s3_to_redshift(table_name = "data1", file_path = "s3://bucket1/data_name.csv",
                                  newtable = False, 
                                  query_gen = sql_create, 
                                  delimiter = ",",  
                                  skip_header = 1)

#===============================================================
Question: How do we download data from IBM netezza data warehouse?
#===============================================================
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
         PORT=5480;DATABASE=***;UID=*****;PWD=*****;DSN=NZSQL")
print("Connection complete")
cursor=conn.cursor() 

cursor.execute("SELECT * FROM data_input limit 1000")
colnames=[desc[0] for desc in cursor.description]
df=pd.DataFrame.from_records(cursor.fetchall(),columns=colnames)
df=pd.DataFrame.from_records(cursor.fetchmany(size=2),columns=colnames) 
df.shape
df.describe()
df.cols()

cursor.close()
conn.close()
#===============================================================

Question: How do we upload data from python into IBM Netezza data warehouse?
#===============================================================
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
         PORT=5480;DATABASE=***;UID=*****;PWD=*****;DSN=NZSQL")         
cursor = conn.cursor()
# create table
cursor.execute("CREATE TABLE TEST("
    "C_ID  INT,"
    "C_FP  FLOAT,"
    "C_VARCHAR VARCHAR(100),"
    "C_DATE DATE, C_TIME TIME,"
    "C_TS TIMESTAMP,"
    "C_BOOL BOOL)")
cursor.execute("INSERT INTO test VALUES(1,1.1,'asdgaing',
      '1901-01-01','23:12:34','1901-01-01 09:00:09','t')")
cursor.execute("INSERT INTO test VALUES(2,3.4,'asdfg2345',
      '1991-11-11','00:00:01','1981-12-31 19:19:19','f')")
cursor.execute("SELECT * FROM TEST")
rows = cursor.fetchall()
for row in rows:
    print(row, end='\n')
cursor.execute("DROP TABLE TEST CASCADE")
cursor.close()
cnxn.close()
#===============================================================

Question: How do we upload data from python onto amazon redshift?
1st method: this is relatively easier method, read csv to python, then upload to redshift using pandas_redshift.

#===============================================================
import pandas_redshift as pr
pr.connect_to_s3(aws_access_key_id = '***********',
              bucket ='bucket_name',
              aws_secret_access_key = '*********'
              )
pr.connect_to_redshift(dbname = "your_database_name",
               user = "your_user_name",
               password = "your_pw",
               port = "5439",
               host = "****.redshift.amazonaws.com")

data_type=Your_data_in_python.dtypes.replace
(['object',     'float64', 'int64',  'datetime64[ns]'],
['VARCHAR(255)','FLOAT4',   'INT', 'TIMESTAMP'])
pr.pandas_to_redshift(data_frame = Your_data_in_python,
redshift_table_name = 'New_dataname_in_redshift',column_data_types =data_type) 
#===============================================================

2nd method: another way using copy command to upload csv file to redshift:

#===============================================================
--Step 1. create the schema via aginity workbench for redshift: 
drop table IF EXISTS public.data_name_in_redshift;
CREATE TABLE public.data_name_in_redshift
(	
	num_var1 NUMERIC(15, 5) ENCODE bytedict,
  num_var2 NUMERIC(15, 5) ENCODE bytedict,
	char_var1 VARCHAR(256) ENCODE lzo
	)
DISTSTYLE EVEN;

##Step 2. upload the csv file onto the S3 bucket, make sure: 
#a. no header allowed in the csv. 
#b. the order of variables matches with the schema set up. 
 
##Step 3. run the following in python: 
ut8 = run_query("""delete from data_name_in_redshift""")
ut9 = run_query("""copy data_name_in_redshift
from 's3://folder_location/csv_file_name.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567890:role/RedshifttoS3'
csv 
DELIMITER ',' 
NULL AS 'NULL' 
DATEFORMAT as 'YYYYMMDD';
""")
    
--if there are some errors, run the following: 
select * from stl_load_errors limit 1000;
#===============================================================

Question: How do we use simple sql in python dataframe?
Answer: In python you can use a lot other embedded functitons to do ".groupby" or ".pivot_table" summries statistics, it's very flexible, however, the syntax is not that easy to remember, you probably need to check the example syntax everytime before you need to use those.

Most people are familiar with SQL, either from the previous analytic IT related training, or from the previous SAS/R SQL experience, so it would be great if we can continue using the similar SQL query in python. We can use pysql for this purpose.
#===============================================================
pip install pysql ##== run this in anaconda prompt to install==

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())  
##==========more explanation for globals/locals later=========##

data1=pysql("""select var0,count(*) as cnt,sum(var1)/sum(var2) 
          as pcnt from data0 group by 1""")
          
##very tricky: if you use pysql in a defined functiton, 
##then need to define data0 as global inside function.           
          
data2=pysql("""select a.*,b.pcntt from data0 as a 
          inner join data1 as b on a.var0=b.var0""")

##==replace function in sqlite ==##          
SELECT REPLACE('xx is very light, xx is easyn', 'xx', 'SQLite')
          
SELECT Name, SUBSTR(Name,2,4), SUBSTR(Name,2) FROM Students;

##==select aut_name ending with string 'on'
SELECT aut_name FROM author WHERE LIKE('%on',aut_name)=1;

SELECT aut_name FROM author WHERE WHERE LIKE('L_n_on',aut_name)=1;
        
##==Use 3 double quotes for mutiple lines, or use 1 double quote.          
#===============================================================
Notice pysql is working with sqlite databases, you can use a lot of SQL related functions,including group by, join, having statement(inside amazon redshift, you can't use having statement). However, there is some limitations when you use pysql with sqlite. You can only use the following core funtions inside sqlite: avg, count,max, min, sum, abs, random(random() function takes no argument), round(12.356,2), substr(string_var,starting_position,num_of_char), instr(orig_string_var,substr_to_find), upper, lower, replace, length, ltrim, rtrim, trim,coalesce.

WHERE LIKE('%on',aut_name)=1; ##return anything with on at the end of the string.
WHERE LIKE('L_n_on',home_city)=1; ##return anything like: London, has to be exact location match.

If you have to use some other functions, like exp(), ln(), then you can use the numpy package. In case if you are interested in using pivot_table or groupby functions to do summaries, here are some examples.
#===============================================================
test3 = pd.merge(test1, test2, left_on=['var1','var2'], right_on=['var_a','var_b'],
left_index=False, right_index=False,how='left')

list1 = ['metric_var1','metric_var2']
summary1 = ds[ (ds['var1']>=0) & (ds['var2']==0)]
.pivot_table(index='var0', values=list1, aggfunc='sum')

summary2=ds[(ds['var1']>=date0) & (ds['var2']==0)]
.groupby('var0',as_index=False).agg({"var4": 'count'})
#note that groupby will not include any nan values, i.e. groupby output is not displaying null values, we need to fillna first via:
df1=df.fillna(0,inplace=True)

seg1b=seg1a.groupby('mid', as_index=False).agg({'vin':'count','var2':'mean'})
seg1c=seg1b[(seg1b.vin>=5)] #the columns name: mid, vin, instead of mid,count
t2c=t2b.groupby(['sale_yr'],as_index=False).agg({"ratio_list": [("avg1",'mean'),("median1",'median')]})
t2c

Syntax of if...elif...else:
#===============================================================
if test expression:
Body of if
elif test expression:
Body of elif
else:
Body of else
#===============================================================


Question: What's global/local variable/dataframe/environment in Python?
Week 2: #1: Example of a simple event-driven program
# Import CodeSkulptor GUI module
import simplegui

# Event handler
def tick():
    print "tick!"

# Register handler
timer = simplegui.create_timer(1000, tick)

# Start timer
timer.start()
#2: global vs local examples
# num1 is a global variable
num1 = 1
print num1

# num2 is a local variable
def fun():
    num1 = 2
    num2 = num1 + 1
    print num2
fun()

# the scope of global num1 is the whole program, num 1 remains defined
print num1

# the scope of the variable num2 is fun(), num2 is now undefined
print num2

# why use local variables?
# give a descriptive name to a quantity
# avoid computing something multiple times

def fahren_to_kelvin(fahren):
    celsius = 5.0 / 9 * (fahren - 32)
    zero_celsius_in_kelvin = 273.15
    return celsius + zero_celsius_in_kelvin
print fahren_to_kelvin(212)

# the risk/reward of using global variables
# risk - consider the software system for an airliner
# critical piece - flight control system
# non-critical piece - in-flight entertainment system
# both systems might use a variable called "dial"
# we don't want possibility that change the volume on your audio
# causes the plane's flaps to change! 

# example
num = 4

def fun1():
    global num
    num = 5
    
def fun2():
    global num
    num = 6

# note that num changes after each call with no obvious explanation    
print num
fun1()
print num
fun2()
print num

# global variables are an easy way for event handlers
# to communicate game information.
# safer method - but required more object-programming techniques
# Local vs. Global Variables Example # For this example, there are five versions of the same program. Three of them work and two of them don't. The goal of the program is to move an object along a number line using a move method, keep track of its location with the variable loc, and calculate the displacement (always positive) from its starting location with the displacement method.
# Version one: Doesn't work. The move method throws an error,
#	while the displacement method always prints 0 because
#	the location never changes.
start = 1
loc = start

def move(x):
    loc = loc + x

def displacement():
    return abs(loc - start)

#move(3)
print "Version 1:", displacement()
#move(-2)
print "Version 1:", displacement()
print

# Version two: Works. Fixes version one by declaring variables
#	global at the start of the move method. No global
#	declaration is needed at the start of the displacement
#	method because the values in the global variables loc 
#	and start are not changing.

start = 1
loc = start

def move(x):
    global loc
    loc = loc + x

def displacement():
    return abs(loc - start)

move(3)
print "Version 2:", displacement()
move(-2)
print "Version 2:", displacement()
print

# Version three: Also works. This one returns values instead
#	of attempting to override the global variable. Notice 
#	that the new local variable must have a different name.
#	Notice also that we must assign loc to the value returned
#	by the move method.

start = 1
loc = start

def move(x):
    pos = loc + x
    return pos

def displacement():
    return abs(loc - start)

loc = move(3)
print "Version 3:", displacement()
loc = move(-2)
print "Version 3:", displacement()
print

# Version Four: This one does not work. The loc that is 
#	a parameter in the move method is actually a local 
#	variable instead of a global one, and therefore the 
#	value of the global loc does not change.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

move(3, loc)
print "Version 4:", displacement()
move(-2, loc)
print "Version 4:", displacement()
print

# Version Five: This one fixes the problem from version 
#	four. This one passes the method the value of loc as a 
#	parameter, and returns the value of the new loc. Note 
#	that in this example the local variable shares the same
#	name as the global one, but is not actually the same thing.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

loc = move(3, loc)
print "Version 5:", displacement()
loc = move(-2, loc)
print "Version 5:", displacement()
print

# CodeSkulptor runs Python programs in your browser.
# Click the upper left button to run this simple demo.
# CodeSkulptor runs in Chrome 18+, Firefox 11+, and Safari 6+.
# Some features may work in other browsers, but do not expect
# full functionality.  It does NOT run in Internet Explorer.


import simplegui

message = "Welcome!"



# Handler for mouse click

def click():

global message

message = "Good job!"



# Handler to draw on canvas

def draw(canvas):

canvas.draw_text(message, [50,112], 36, "Red")


# Create a frame and assign callbacks to event handlers

frame = simplegui.create_frame("Home", 300, 200)

frame.add_button("Click me", click)

frame.set_draw_handler(draw)



# Start the frame animation

frame.start() 

No comments:

Post a Comment

Python study notes 6: Spark SQL, Pyspark tutorial examples

Question: How to solve ? #=============================================================== #=========...