Saturday, December 7, 2019

Python study notes 1 - Installation basics, local vs global

How do we install packages in python?
How do we send emails in python?
How do we load/output csv file in/to Pytyon
How do we output import pickle file in python
How do we read/load R datasets in/to Python
How do we create empty dataframe and then fill values
How do we subset dataframe in Pytyon
How do we use simple SQL to python dataframe
How do we download data from netezza?
How do we download data from amazon redshift?
How do we upload data from python onto amazon redshift?
How do we include a seprate piece of python code to run?
How do we convert a list to dataframe ?
Local vs. Global Variables
How do we get outliers from python? 

How do we rank column by certain group?
Internal studentized residual VS external studentized residual
How do we scatter plot in python?
How do we drop big dataframe and release memory in python?
How do we break a list column to multiple bins, decile or quantile?
How do we run multiple python scripts in parallel in shell?
What is the code __name__ == "__main__" for

Question: How do we install python the easiest?
In case if you have any concern or issues with Python, you can re-install python via anaconda via the following steps:

1. Click uninstall "Uninstall-Anaconda3" icon from the link: C:\Users\*****\AppData\Local\Continuum\anaconda3 to uninstall previous anaconda. This might take 20-30 mins to finish, and mannually delete the folder: anaconda3 after the uninstallation.

2. Go to to download the corresponding version, installation might take just a few mins.

3. type to run: anaconda prompt, run conda list anaconda to see which version of anaconda installed ,
update to the newest: conda update --all , then conda update anaconda . If you want to run Spyder, you need to go to anaconda navigation to launch Spyder first, then pin to taskbar.

4. install python package via anaconda, for example, run the following code from anaconda prompt window:
conda install -c anaconda boto3
conda install -c conda-forge s3fs  
conda install -c conda-forge psycopg2                                    
conda install -c conda-forge pandasql
conda install -c pandas-datareader
#don't run this: conda install -c conda-forge keras
#it will cause anaconda promt not working,use pip to install keras
pip install tensorflow
pip install keras
conda install -c conda-forge sqlalchemy 
conda install -c conda-forge sqlalchemy-redshift 
#this one will not work: conda install -c conda-forge pandas-datareader                              
#it will give you error message: anaconda packagesnotfounderror the following package not available.

#install NLP related packages, topic modeling etc. 
conda install -c anaconda gensim
conda install -c anaconda spacy
conda install -c anaconda pyLDAvis
conda install -c anaconda nltk
# Run in python console
import nltk 
#in case if you load the en_core_web_sm from spacy,you might got the error message: 
nlp = spacy.load('en_core_web_sm')
##error message: Can't find model 'en_core_web_sm', you can run the following from terminal: 
python -m spacy download en
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
python -m spacy link en_core_web_sm en

#If still nont working:
#Then you can use the link directly from github: 
pip install --no-deps

pip install stockstats
pip install
#or use this: pip install mpl_finance

#This one might be working: conda install -c primer stockstats
#it will not work with: conda install -c stockstats
#neither work with: conda install -c conda-forge stockstats
#in case if error: unsatisfiableerror conflict conda version, 
#run at anaconda prompt for: 
pip install stockstats 

#if somehow the package still not find after installation, you might want to check the directory
#expecially for different VM/worker node, you might need to log into the root accout first via sodu su -:
step 1.  gcloud compute ssh edc-cluster-w-2 --internal-ip --zone=us-east1-a
step 2. sudo su -
step 3. pip install pyarrow==0.14.0
step 4. exit
step 5. exit

Question: How do we standardize data in python?
standard_sc = scale.StandardScaler()
atm3b = pd.DataFrame(standard_sc.fit_transform(atm3a))

#min_max_scaler = preprocessing.MinMaxScaler()
#X_train_minmax = min_max_scaler.fit_transform(X_train)

Question: How do we include a separate piece of python code to run?
#example of put some common re-used functions in a "package": 
#Treat it like a module: import file, don't add .py name together:
#if your file is, you should should just use:
from file1 import
#better than using: import file1, b/c file1.function1(), call filename everytime.

#an example running another piece of python code in a python session.
#note: put the file in the same directory of your main python file.
import os 
#be careful for this approach: will not keep defined function inside. s

#another way to import particualr functions:
import importlib
comm_util = importlib.import_module('.s3_comm_util', package='dsv-aws-common')
##package here is the folder name
get_pickle_from_s3 = comm_util.get_pickle_from_s3
get_s3_csv_gz = comm_util.get_s3_csv_gz

# include a separate python code to run, need to change folder first 
##===to see the curent working directory===##
import os
cwd = os.getcwd()

# use '\\' while chaning the directory 

#Then from that directory you can import any frequetly-used function 
from comm_util import drop_table,  init_connection, put_pickle_to_s3
import comm_util

Question: How do we send emails in python?
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

fromaddr = ""
##== you can also set it empty:  fromaddr = "" 
toaddr = ""
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['To'] = toaddr
msg['Subject'] = "Oh, Yeal! Python Task finished! "

body = "Python Task-1 finished"
msg.attach(MIMEText(body, 'plain'))

import smtplib
server = smtplib.SMTP('', 25)
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)

Question: How do we install packages in python?
In Python, packages such as Flask are available in a public repository, from where anybody can download them and install them. The official Python package repository is called PyPI, which stands for Python Package Index(PyPI).
#use the following to get any help: 
!pip install --help  #running in notebook or spyder
#===there are several ways to install packages in python
#Method 1: using anaconda to install
#search Anaconda Prompt from "Start"
pip install -U pandasql
# if it shows permission error, use this one: 
pip install pandasql --user    
# ==========================================================
#why we are using --user: because by default, pip defaults to installing Python packages 
#to a system directory (such as /usr/local/lib/python3.4). This requires root access.
#--user makes pip install packages in your home directory instead, 
#which doesn't require any special privileges.

pip install pandas_redshift #== PIP:python install package?
pip install spyder    #=====  install spyder
pip install --upgrade spyder  #==== upgrade spyder

#Method 2: if no anaconda, then run the code directly in IDE spyder
# or run the following in jupyter notebook 
import pip
from pip._internal import main #===it's due to the pip3 issue==
#==for older version: AttributeError##
#==module 'pip' has no attribute 'main' if use: 
here is some tricky way to format all float number to 3 decimals, instead of scientific notations.

import pandas as pd
pd.options.display.float_format = '{:.4f}'.format

#use "," as 1000 separators
df[c] = df[c].astype(int).apply(lambda x: f'{x:,}')

Question: How do we run multiple python scripts in parallel in shell?
Run the following code in the shell/command line:
python &
python &

Question: How do save/export/load python data frame to R datatype?
###Export in Python===
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
###Import in R=========
path <- df="" file="my_data.RData" font="" my_data.feather="" path="" read_feather="" save="">
Question: Why do we get the data size of all folder from a directory?
import os
import numpy as np
import pandas as pd
from os.path import join, getsize

from sys import getsizeof
print('the size of data in GB is ', getsizeof(p0)/(1024*1024*1024))

global t0,all0
d = {'size_of_MB':[0], 
     'directory_file': ['*******************************************************']}
t00 = pd.DataFrame(data=d)

def get_size_folder():
    global t0,all0
    for root, dirs, files in os.walk('C:\\Users\\***\\Google_Drive'):
        size=round(sum(getsize(join(root, name))/(1024*1024) for name in files),0)  
#        print(t0)
#        print(i,' : ',root, ": ",size,' MB ')
    return all0


#use the split to scan break the string into multiple substrings
new = a0["directory_file"].str.split("\\", n = -1, expand = True)
a1["d1"]= new[4] 
a1["d2"]= new[5] 
a1["d3"]= new[6] 
a1["d4"]= new[7] 
a1["d5"]= new[8] 
a1["d6"]= new[9] 
a1["d7"]= new[10] 

#check the size the least granual folder
a11=pysql("""select d1,sum(size_of_MB) as size_of_MB 
from a1 group by 1 order by 2 desc """)

a12=pysql("""select d1,d2,sum(size_of_MB) as size_of_MB 
from a1 group by 1,2 order by 3 desc """)

a13=pysql("""select d1,d2,d3,sum(size_of_MB) as size_of_MB 
from a1 group by 1,2,3 order by 3 desc """)
Question: How do we handle date, year,month in python? How do we get the corresping weekend for the specified date in python?
##if date1 is already in datetime format
df1['year'] =df1['date1'].dt.year 
df1['month'] =df1['date1'].dt.month

weekno =


##====it's more tricky to get those value for a variable in dataframe
##==you need to apply dt to overcome the series error message.

##some other tricky tips to deal with datetime
from datetime import datetime    
row = "2011/1/1 0:30"
date1 = datetime.strptime(row, '%Y/%m/%d %H:%M')
#2011-01-01 00:30:00                                                                       

Question: How do we do some quick summary statistics in python?
Answer: we can use the following functions get quick summary statistics for a dataframe. Some of the tutorials are referenced from here.
import sys
import pickle
import os
import _pickle as cPickle
import gzip
import datetime
from datetime import date
from datetime import timedelta
import pickle
import pandas as pd
import numpy as np
import scipy as sp
#import gc
#from pandasql import sqldf
#pysql = lambda q: sqldf(q, globals())
pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 500)  
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth', 250)
pd.set_option('display.width', 125)

 ##==check the frequency of the variable==##

 #get the maximum value
 #get sample of dataframe
 a1b[((a1b.bad_95=='1') & (a1b.src=='tax'))].sample(n=20,random_state=1)

 print(" dataset has {} data points with {} variables each.".format(*data0.shape))
 data0=shf[(shf['var1'].notnull()) | (shf['var2'].notnull())  ]
 #use isnull() or notnull() to select

 #generate a pie chart for the summary descriptive categories: 
 type_counts = df['Type'].value_counts() 
 df2 = pd.DataFrame({'house_type': type_counts}, 
                     index = ['t', 'h', 'u']
 df2.plot.pie(y='house_type', figsize=(10,10), autopct='%1.1f%%')

 ##==check the unique value of  variable==##
 ##==check the distribution of the variable==##
 ##==check the count of null of the variable==##
 print(" dataset has {} data points with {} variables.".format(*data0.shape))
 ##example of rolling average, min,max,cummulative sum
df['L9'] = df['low'].rolling(window=9).min()
df['H9'] = df['high'].rolling(window=9).max()
df['RSV'] = 100*((df['close'] - df['L9']) 
               / (df['H9'] - df['L9']) )
df['k'] = df['RSV'].rolling(window=3).mean()
df['d'] = df['k'].rolling(window=3).mean()

#maximum and minimum of "X" and "Y" columns (for each row)
df[["X", "Y"]].max(axis=1)
df[["X", "Y"]].min(axis=1)

data1['max1'] = np.maximum.reduce(data1.iloc[:,[-1,-2,-3,-4]].values, axis=1) 
#max of 4 columns.
data1['max1'] = np.maximum.reduce(data1.iloc[:,list(range(9,140,2))].values, axis=1) 
#max of more columns, staring with 9, then every other item
data1a=data1.iloc[:,[0,1,2,3,4,5,-1,-2,-3,-4,-5] #rearrange/reorder the columns
df['var_new']=df['var'].clip_upper(max1) #deprecated
df['var_new']=df['var'].clip_lower(mim1). #deprecated
Notice all those functions, value_counts(),unique(),isnull(),notnull(),sum(), they all have bracket together. In fact, we can define a function to do a few summary statistics for any dataset:
 def check(data0):
    print(' Num of records: {}  and  Num of variables: {} '.format(*data0.shape))
    print('\n Num of Null records: \n', data0.isnull().sum())
    print('\n List of variables: \n', list(data0))
    print('\n Variable data types: \n ', list(data0.dtypes))    
    print('\n  The first 5 records: \n ', data0.head())
    print('\n  The last 5 records: \n ', data0.tail())
    print('\n  Variable distribution: \n ', data0.describe())
Python coding for different layer by different indent, so if you have coded some lines of codes without any indent, then you decide to indent furthrer, for example, if you decided to include them in a defined function, then you need to indent those lines of codes together, it could be hundreds of lines of codes, do you want to indent line by line? Difinitely not a good choice. You can highlight/select the lines you want indented, then press TAB as often as needed until they reach the proper indent level. You can remove spaces with SHIFT TAB.

You can also use CTRL+ALT+I to auto-indent the selection. If you're using IDLE (or Python notebook .py file), you can use Ctrl+] to indent and Ctrl+[ to unindent.(On macbook, use Command + [ to unindent. )

Question: How do we get the SAS-retain similar statement in python?
Say, there are a few places missing value in the data, can we impute them to the previous one for the same ID? Here is an example:
d = {'id':['AA','AA','AA','AA','BB','BB','BB','CC','CC'], 
     'value': [3, 6, 'NA',3,5,'NA',7,3,'NA']}
df = pd.DataFrame(data=d)
##In reality, this "row" variable could be some other level##

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())

m3a=pysql("""select * from df where value='NA' """)
m3b=pysql("""select * from df where value<>'NA' """)

m3c=pysql("""select,a.row as row0,b.row as row1,
a.row-b.row as diff from m3a as a left join m3b as b
 on and a.row>b.row """)

m3d=pysql("""select,row0,min(diff) as diff 
from m3c as a group by 1,2 order by 1,2""") 

m3e=pysql("""select a.* from m3c a inner join m3d as b 
on and a.row0=b.row0 and a.diff=b.diff""")  
m3f=pysql("""select,b.row0,a.value from m3b as a
inner join m3e as b on and a.row=b.row1""") 

m3g=pysql("""select,a.row,a.value as value_original,
coalesce(b.value,a.value) as value from df as a left join
m3f as b on and a.row=b.row0 order by,a.row""")

#to combine multiple columns in panda dataframe, you can use fillna('') to coalesce: 
df['c'] = df.a.combine_first(df.b)  # use a first;  elseif a=NaN, then use b.
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
#will have multiple values mixed

cols = ['keywords_0', 'keywords_1', 'keywords_2', 'keywords_3']
df["keywords_all"] = df[cols].apply(lambda x: ','.join(x.dropna()), axis=1) 
#join multiple keywords together with ',' separated. 

join two columns string together
Question: How do we create variable for each iteration of the for loop:
for i in range(10):
	x(i) = 'abc'+str(i)  #not working!

so that I would get x1, x2, x3, x4, .... like 'abc1','abc2' 
for i in range(10):
    globals() ["x"+str(i)] = 'abc'+str(i)
    locals() ["y"+str(i)] = 'abe'+str(i) 

#another way to do this: 
x = []
for i in range(10):

>>> print(x[0])
>>> print(x[8])
Question: How do we load data text/csv/excel into/output python?
Before we mention the load/upload files, pay attention to some code like:
file=r'C:\Diretory\Name_of_the_file_import.csv', what's the r for?
Character Function
r Open file for reading only. Starts reading from beginning of file. This default mode.
rb Open a file for reading only in binary format. Starts reading from beginning of file.
r+ Open file for reading and writing. File pointer placed at beginning of the file.
w Open file for writing only. File pointer placed at beginning of the file. Overwrites existing file and creates a new one if it does not exists.
wb Same as w but opens in binary mode.
w+ Same as w but also alows to read from file.
wb+ Same as wb but also alows to read from file.
a Open a file for appending. Starts writing at the end of file. Creates a new file if file does not exist.
ab Same as a but in binary format. Creates a new file if file does not exist.
a+ Same a a but also open for reading.
ab+ Same a ab but also open for reading.
 #read csv file using panda.read_csv
# Make sure pandas is loaded
import pandas as pd
test = pd.read_csv(file)
pd.read_csv('data.csv', sep='|', index_col=False, nrows=1000,  
             names=['protocol', 'server', 'type', 'value'])
#sometimes you might see some error message: "unicodedecodeerror utf 8 
#codec can t decode byte ** in position ** invalid start byte pandas" 
#then we need to add the decode method: 
test = pd.read_csv(file,encoding = "ISO-8859-1")
test = pd.read_csv(file,encoding="utf-8")

#one more tricky thing about read_csv() if you have some field with value: "asddg asdfg
#notice there is only one double quote, the 2nd double quote is missing. 
#in this case, you have to use some other option quoting=3 or quoting=csv.quote_none
ca1 = pd.read_csv(file,sep='|',delimiter='|',quoting=3)
#if not doing this, you will get less records then expected. 
#the value in the field will be with only 1 double quote, 
#will not continue to looking for the 2nd double quote. 

#convert True or False to 1 or o: 
#in case if you got the error message: DtypeWarning: Columns (3,5) have mixed types. 
#Specify dtype option on import or set low_memory=False.
#use the following dtype option to be object: 
df = pd.read_csv("gs://folder/data1.csv",sep='|',delimiter='|',dtype=object)
df = pd.read_csv("gs://folder/data1.csv",compression='gzip',sep='|',delimiter='|',dtype=object)

#when use pd.read_csv, there is option: na_values for null. #
#By default the following values are interpreted as NaN: 
#'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
# '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A', 
# 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'.

 #save csv file using panda.to_csv
#index=False is not to output the column num(index) as one column 

#read text file into python
f = open("demofile.txt", "r")
for x in f:

#another method to read txt
f = open("demofile.txt", "r")

#read excel file into python
import pandas as pd
##don't use the other one: \ or \\ ##

 #save results to excel sheets
writer = pd.ExcelWriter(file, engine='xlsxwriter')

result[0].to_excel(writer, sheet_name='Sheet1_name')
workbook =

worksheet = writer.sheets['Sheet1_name']
percent_fmt = workbook.add_format({'num_format': '0.000%', 'bold': False})
worksheet.set_column('A:G', 20)
worksheet.set_column('C:G', 20, percent_fmt)

Question: How do we output import pickle file in python
Answer: output csv file is the most common way to save your output, but in python, you can also save your output using a python module called pickle. Pickle is used for serializing and de-serializing a Python object structure.

Why do we need to serializing?: Java serialization allows you to read/write arbitrarily complicated Java objects, automatically or manually from/to disk or from/to the network. If you are using csv or any other delimiter format to transfer the file, you will always have unexpected troubles. Any object in Python can be pickled so that it can be saved on disk.

What pickle does is that it "serializes" the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script. The disadvantage of this pickle format is that, only python can read and write, and it might have compatibility issue between different python versions.
# take user input to take the amount of data
number_of_data = int(input('Enter the number of data : '))
data = []

# take input of the data
for i in range(number_of_data):
    raw = input('Enter data '+str(i)+' : ')

# open a file, where you ant to store the data
file = open('data_name', 'wb')

cwd = os.getcwd()
print(cwd) #this will be the location of the output pklz file

#more recommended for dump the pickle this way: 
import gzip, pickle
with'test.pklz', 'wb') as ofp:
    pickle.dump([1,2,3], ofp)

with open('test.pklz', 'rb') as ifp:

# dump information to that file
file = open('data_name', 'wb')
pickle.dump(data, file)
# remember to close the file, 
#otherwise you might have some error message later when try to read:
#EOF Error: Ran out of input, if you forget to close.

#in case if you got some error message: 
#OverflowError: cannot serialize a bytes object larger than 4 GiB
#then you can try the following version: 
pickle.dump(d, open("file", 'w'), protocol=4)

cwd = os.getcwd()
cwd1 = os.path.join(cwd,'data_break/')
for cntycd1 in list_cntycd2:
    import datetime
    time_start =
    file = open(file1, 'wb')
    cPickle.dump(temp0, file, protocol=4)
    time_end =
    print("seconds taking: ", time_mins, " sec")
    print("Minutes taking: ", round(time_mins/60,1), " mins")

#or use the following:
import gzip   #use gzip to compress the file.
with, 'wb') as f_model_pkl:
    pickle.dump(test2, f_model_pkl)

import _pickle as cPickle #cPickle is much more fast
file1 = cPickle.load(, 'rb'), encoding="ASCII")     
# =============================================================================
# #Python Pickle load
# =============================================================================
# open a file, where you stored the pickled data
file = open('important', 'rb') #opening the file in read-binary (rb) 
data = pickle.load(file)
# close the file
The main difference between cPickle and pickle is performance. The cPickle module is many times faster to execute because it's written in C and because its methods are functions instead of classes.

Sometimes you might get some error message like this: _pickle.UnpicklingError: invalid load key, This may not be relevant to your specific issue, but I had a similar problem when the pickle archive had been created using gzip. For example if a compressed pickle archive is made like this,
import gzip, pickle
with'test.pklz', 'wb') as ofp:
    pickle.dump([1,2,3], ofp)

with open('test.pklz', 'rb') as ifp:
Trying to open it throws the errors: Traceback (most recent call last): File "", line 2, in _pickle.UnpicklingError: invalid load key, ''. But, if the pickle file is opened using gzip, it solves the problem:
with'test.pklz', 'rb') as ifp:

Question: How do we save pandas dataframe to parquet?
Parquet is a better choice than pickle for storage pandas dataframe. It seems less error message coming out, not like pickle.dump or pickle.load, easily to get error message. we can use fastparquet to save python dataframe as parquet, for instance, we have one panda dataframe df1, need to save to gs bucket using the most efficent way:
df1.to_parquet('US_163_data_23M.parquet', engine='fastparquet',compression='GZIP')

#read the parquet file:
#if you got error message about missing parquet, install the fastparquet from terminal:
conda install -c conda-forge fastparquet
#not recommend the following:
pip install -U fastparquet
#you might get some error message: unable to execute 'gcc' no such file or directory
#run the previous conda installation code to get around the error. 

In python:     
from fastparquet import write 
write('outfile_test.parq', df1)

Saving in the parquet format can save a lot of space. We have tested saving the same dataset using pickle, took up 30+ GB of space, however, it only takes <1gb approach.="" b="" for="" format.="" gzip="" highly="" of="" parquet="" recommend="" space="" this="" using="">Question: How do we create empty dataframe and then fill values ?
Sometimes we want to create an empty dataframe and then assign values/rows/data into the dataframe, like the following code, but the output is very strange:
print(df1)  #the output is nothing, empty!
You need to have at least one actualy row to start assign more values/rows into that dataframe, here is one way you can do:
#need to have at least one row!
#all0=pd.DataFrame(index=range(1))[0:0] is not good.

#or df1=pd.DataFrame({'var1':['something1']})  
# {} is for the key-value pair, [] for the list of values.
#or df1=pd.DataFrame({'var1':['something1'],'varz':['so******']})

Question: How do we convert a list to string in python?

# Function to convert   
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 
# Driver code     
s = ['Geeks', 'for', 'Geeks'] 

#output will be: 
Geeks for Geeks

#or use the combined code: 
output=" ".join(s)

Question: How do we subset dataframe in python?
#Print List Without Square Brackets in Python
config['model'] = ['zip5_street1','zip5_street','zip7']
#in pyspark, we need to drop those column from pyspark dataframe:
#not work: df0=final_df.drop(config['model']))
#the following is not working neither: 
#you can use str(config['model'])[1:-1] to remove the square brackets.

#this is working: 
#the trick is to use * to "convert" to tuple

# Method best: select via columns number========
#very convenient to use iloc to subset data: ====
#select ith column: Python uses 0-based indexingall0.iloc[:,i-1]

# select rows of 0 and 10 (not from 0 to 10)
test.loc[[0, 10], :]

#select 4 columns in different order

#To select rows whose column value is in list 
years = [1952, 2007]
df1['bad_1']=np.where((df1.year.isin([1952, 2007])) & (df1.score>=0.95), 1,0)

#select unique values
array([1952, 2007])

#select the ones not from the list of values
continents = ['Asia','Africa', 'Americas', 'Europe']
gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)]
gapminder[~gapminder.continent.isin(continents) & 

#drop the columns BC, they are equilent
df.drop(['B', 'C'], axis=1)
df.drop(columns=['B', 'C'])
#axis=1 for columns, axis=0 for rows

#rename one column name
all2.rename(columns={ "off-white": "off_white"}, inplace=True)

#change data types, using astype() or to_numeric()
#in case if some error: "could not convert string to float", use  to_numeric()
df['b'] = pd.to_numeric(df['b'], errors='coerce')

#in case if some error: "could not convert series to float()", use numpy, instead of math
t1['pcnt_gain_year']=np.power(t1['pcnt_gain'],1/5)-1  #this will work fine. 
t1['pcnt_gain_year']=math.pow(t1['pcnt_gain'],1/5)-1  #error will show up.

#in case if you want to convert to interger: 
df['b'] = pd.to_numeric(df['b'], errors='coerce',downcast==‘integer’)
#downcast : {‘integer’, ‘signed’, ‘unsigned’, ‘float’} , default None

errors='coerce'):pd.to_numeric(all1['from2b'], errors='coerce')]

#rename all column names
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(index=str, columns={"A": "a", "B": "c"})

# select rows 0, 1, 2 (row 3 is not selected)
# Method 1: select a 'subset' of the data using the column name
test[['var1', 'var2']]

# Method 2: use the column name as an 'attribute'

# select the first 5 rows (rows 0, 1, 2, 3, 4)

# select the last element in the list
# (the slice starts at the last element,
# and ends at the end of the list)

#create dummies via where statement


#stack 2 data frame together
pd.concat([df1, df2])
pd.concat([df1, df2]).drop_duplicates()

#remove duplicated records from test2 
keep='first', inplace=True)

#subset dataframe via muitiple filters
         & (test['saledate']<=holdout + timedelta(days=7*i))]

##define columns name for a data frame
df.columns = ['a', 'b']

##create a empty dataframe, then add rows using append## 
ref_data=ref_data.append({'Coefficient':0.0, 'Variable': 
"C(var1, Treatment(reference=ref_var))[T."+str(ref_var)+"]"}, 
f=pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)  

#join two lists of strings together:

#remove something from list of items
#in case if you got some error: can't operand between list and list, use set(): 

mergedlist = []

df1=df.sample(n=3, random_state=12345) #sample of 3 records
df2=df.sample(frac=0.5, replace=True, random_state=1) #sample of 50% of records

#stratified sample, i.e. sample 2 rows by each group column, say 'year'
d1f=d1e.groupby('year', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))

Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are more likely to be sampled.
df3=df.sample(n=2, weights='num_specimen_seen', random_state=12345)

Some examples about handing string operation:
####====Find the substring from a string====####
#regex=True, assumes the pat is a regular expression.
#regex=False, treats the pat as a literal string.
##Most of time, just use False we care about string.##

#find the position of some (special) character
t1c=pysql(""" select *,substr(from1,from2a,from2b-from2a) as from2 from all1 as a """)
#to get the email address.

#get substring(first 10 charaters) from a column in dataframe
df['New_var'] = df['var'].str.slice(0,9)

s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.NaN])
s1.str.contains('oG', case=True, regex=False)
0    False
1    False
2    False
3    False
4      NaN
#Returning any digit using regular expression.
s1.str.contains('\d', regex=True)

#return anything followed by 0
s2 = pd.Series(['40','40.0','41','41.0','35'])
s2.str.contains('.0', regex=True)
0     True
1     True
2    False
3     True
4    False

##== we can also use extract ==## 
s1= Series(['a1', 'b2', 'c3'])

s2 = Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
#use += to join 2 string together

#CONCAT(string1,string2) only for 2 strings in sql, not in python##
CONCAT(CONCAT(string1,string2),string3) as string_new
CONCAT(string1,nvl(string2, 0)) as string_new

#substring(str1,start_posi(1,2,...),#_of_chars) as str_new in sql, not in python
substring(string1,5,3) as string_new

 x = "Hello World!"
'llo World!'
'Hello Worl'
'llo Worl'
#get the reverse/backward  of string: 
#select every other letter
"H-e-l-l-o- -W-o-r-l-d"[::2] # outputs "Hello World"

##use .join function to join a few strings##
music = ["Metallica", "Rolling Stones", "ACDC"]

test='asdgasdg H asdfghHHHHLLL H '
test #test will not change
test1=test.replace(" H "," K ")     
#replace string from dataframe column
d1['unique_name']=d1['unique_name'].str.replace('_',' ')    
#compared the three methods by speed: 
#string1.replace(some_string1,new_string) > re.sub > translate.
#translate is slower than re.sub (with precomilation) in about 10 times. 
And str.replace is faster than re.sub in about 3 times.
#using re package to remove/replace puncations  
import re
def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and space 
        w = re.sub(r'\_','',w)  #to remove underscore as well
    return new_words
with open('/home/rahul/align.txt','r') as f:
    f1 =
    f2 = f1.split()
    all_words = f2 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' 
    # You can add and remove punctuations as per your choice 
    #removing stop words in hungarian text and  english text and 
    #display the unpunctuated string
    # To remove from a string, replace new_data with new_str 
    # new_str = "My name$#@ is . rahul -~"

    for word in all_words: 
        if word not in punctuations:

    print (new_data)
#use the following to find out details about special characters      
#remove all the emails address
test1='sadhiong  sdhi  asdg'
test2=re.sub('\S*@\S*\s?', '', test1) 

#here is some good way to replace substring from a list of strings:
words = [w.replace('[br]', '
') for w in words]

# Convert to list
data = df.content.values.tolist()
# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
#\s is for any Unicode whitespace characters, which includes [ \t\n\r\f\v].
#\S is the opposite to \s: for any Non-(Unicode whitespace characters)
#\w: any Unicode word characters: includes most characters that can be part of a word 
#in any language, as well as numbers and the underscore.
#\W: the opposite of \w
# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]
Queston: How do we apply map lambda function in python? Examples of doing loop in python?
xs = [(x+1) * 0.01 for x in range(9)]

[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]

##first load data from AWS S3 bucket
return_df = pd.read_csv('s3://{bucket}/{key}'.format(bucket='bucketname', 
key='the file name'), 
compression='infer', header=0, sep=',', quotechar='"', dtype=dtype)

dfb=dfa.set_index(keys='cntycd',drop=False,append = False)

#set the index of dataframe to be some column 
#drop is to drop that column if True, inplace is to replace the data

#try to make a dataframe to keep the new structure, better than just use: df.reset_index(drop=True), 
#this is highly recommended before doing any data tranformation/preprocessing.
#don't recommend using: inplace=True, might not what you want, just to make a new dataframe. 

#drop=True to avoid the previous index to be added as column 
Series.reset_index(level=None, drop=False, name=None)
#reset index for the series, return dataframe if drop=False
return_df = return_df.sort_values('var1')
return_df = return_df.sort_values(by=['var1', 'var2'],ascending=[True,True])

data1=data.drop_duplicates(subset ="First Name",  #by default, use all the columns.
                     keep = False/first/last) 
#first: drop duplicates except for the first occurrence.   

return_df = return_df.set_index(['var1', 'var2'])
c = mileage_coef_lookup.columns.tolist()
##apply: map(function, iterable): the iterable will be the list c.
return_df.columns = list(map(lambda x: x + '_function', c))
#the output looks like: 'area_function','color_function' ##

#map(fun, iter): map() function returns a list of the results after applying 
#the given function to each item of a given iterable (list, tuple etc.)
def addition(n): 
    return n + n 
# We double all numbers using map() 
numbers = (1, 2, 3, 4) 
result = map(addition, numbers) 
#output: {2, 4, 6, 8}

numbers = (1, 2, 3, 4) 
result = map(lambda x: x + x, numbers) 

numbers1 = [1, 2, 3] 
numbers2 = [4, 5, 6] 
result = map(lambda x, y: x + y, numbers1, numbers2) 

map creates a new list by applying a function to every element of the source:
xs = [1, 2, 3]

# all of those are equivalent: the output is [2, 4, 6]
# 1. map
ys = map(lambda x: x * 2, xs)
# 2. list comprehension
ys = [x * 2 for x in xs]
# 3. explicit loop
ys = []
for x in xs:
    ys.append(x * 2)
Here is the loop from (0,i-1) and (i+1, n), joining them together as one list to loop via itertools.chain.
for i in range(len(list1)-1):
    for j in itertools.chain(range(0,i), range(i+1,len(list1))):
#    for j in range(i+1,len(list1)) & (j!=i):
        print('running for ', cnt, ' ',list1[i],'  --> ',list1[j])
list(range(2,8,3)) #output [2, 5], not include the ending number 8 

Queston: How do we reshape/tranpose/pivot data in python?
#transpose or transform data from horizontal to vertial
columns='var_name', values='var_value')
#values might not necessary to be there.
#after you applied, you will get data2 with multiindex, reset_index afterwards.
##we can add aggfunc(mean/sum), default: numpy.mean.

#in case if you got the error: DataError: No numeric types to aggregate, 
#use aggfunc='first'

#or just use pivot for single index#
columns='var_name', values='var_value')

#or use group by and sum/mean functions#
    .agg({"high_open": [("count",'count'),("avg1",'mean'),("med1",'median'),("prod_accu",'prod')], 
#reset the column names:
df1.columns = ['_'.join(col) for col in df1.columns]
df1.columns = ['_'.join(col) for col in df1.columns.values]
df1.columns = df1.columns + '_year5'

for pyspark column rename: 
df1.schema.names = ['_'.join(col) for col in df1.schema.names]
#don't use df1.columns = df1.columns + '_year5', 
#you will get error message: TypeError: can only concatenate list (not "str") to list 

#use for to apply the string concatenated to the list: 
config['score_input_cols'] = ['score_'+col for col in config['input_cols']]

#print the columns name without quotes:
print(*df1.columns,spe=',')  #use * to unpack the list of items. 
print(*['jdoe is', 42, 'years old'])
>>>jdoe is 42 years old

df2=df.groupby('weekday').mean() #or this following
df2=df.groupby(['symbol', 'weekday']).mean()
data2=data1.groupby(level=1, axis=1).mean()

sh1a=sh.groupby([var2,var3],as_index=False).agg({"clip": 'count'})
To get the data back from vertical to long, we can use melt function:
 new_data=pd.melt(df1, id_vars='var_as_by_category', 
 value_vars=[2010, 2011, 2012]) ##select the ones need to convert
Queston: How do we drop big dataframe and release memory in python?
Answer: After we have run a series of programs, there might be quite some dataframe/series left in the memory, some of them might be huge. how do we drop those dataframe and release the memory from the system, so we can recycle the limited memory, is a very tricky technique we need to pay attention when deal with large dataset.
##check the size of the data frame in terms of MB##
import sys
##output in terms of Byte, change to MB by ##
print('The size of the file is ',
round(sys.getsizeof(data1)/(1024.0*1024.0),1), ' MB')

##you can also use getsize of to get the memory usage of array. 
from sys import getsizeof
round(sys.getsizeof(array1)/(1024.0*1024.0),1), ' MB')
#not working for

#code to list all dataframe in memory
%whos DataFrame'deep')

##2nd way of analyzing dataframe memory usage
##the sum should be equal to the  sys.getsizeof()##

##Technically memory is about this ##
df.values.nbytes + df.index.nbytes + df.columns.nbytes

#here is to delete dataset and release memories.
import gc  ##for garbage collection##
del [[df_1,df_2]]
##if you just run: del data1 might not release the memory##
##put them in a list, delete list will triggle garbage collected##
lst1 = [df1, df2]
del lst1 

Question: How do we save the log to some file in python??
If it's sumbit the job via command line, for instance, submit a python script:
python > file1.log
Sometimes you might want to use :
python &> file1.log
You can use the following logging via FileHandler if it's the code to capture:
import os
cwd = os.getcwd()
log_dir = os.path.join(cwd,'log_1/')
cwd1 = os.path.join(cwd,'datain/')
cwd2 = os.path.join(cwd,'dataout/')

if not os.path.exists(log_dir):
import logging
import logging.handlers

log = logging.getLogger()

filename1= log_dir+'/model__log_'+str('.txt' 
fh = logging.FileHandler(filename=filename1)
                fmt='%(asctime)s %(levelname)s: %(message)s',
                datefmt='%Y-%m-%d %H:%M:%S'

...'this function is doing something')'this function is finished')
del log,fh

Question: How do we load R datasets/objects into/output python?

#===You need to install and use rpy2 package===============
#===rpy2 is an interface to R running embedded in Python.===
#===use the followoing code to install via anaconda prompt===
conda install -c r rpy2 
#===pip install rpy2 might not able to install sometimes===
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
readRDS = robjects.r['readRDS']
df = readRDS('my_file.rds')
df = pandas2ri.ri2py(df)

Question: What happened if you see the error message:ValueError: x and y must be the same size, or valueerror: x and y must have same first dimension when you tried to do some plot in python?

Answer: One of the challenges for Python users are there are so many data formats: array, list, series, data frame, etc. we need to take extra caution for this.

When we tried to plot or scatter plot:
import matplotlib.pyplot as plt plt.figure(figsize=(10,8)) plt.scatter(data1.var_x,data1.var_y,color='green')

One of the direct checking method for ValueError: x and y must be the same size , simply try to find out the dimension of x, y:

to see if they are the same.

One of my funny but painful experience is the following: I have used pysql to join 2 data frames, and some of the variables share the same variable name in both data frames, and I used: select a.*,b.* statement to get all the variables.

For example, if var_y showed up in both data frame 1 and data frame 2, then there will be two columns in the new data frame using the same variable name var_y, now if you tried to do plot:
most likely you will get the previous error message, because you have 2 columns for var_y, meanwhile you only have one column of var_x. So to take care of the issue, the best practice is to rename to different variable name.

Similarly, for the error message: valueerror: x and y must have same first dimension, we can use len(x), len(y) to check the dimension.

Queston: say we have a list of dataframes, i.e. each element in the list is a dataframe with same variable names, how do we combine them into one dataframe?
global New_dataframe
for i in range(len(Old_list)):
Question: How do we download data from amazon redshift?
Answer: we are using the package sqlalchemy to download the data from redshift. Sqlalchemy is a SQL library under the licence from MIT.

First, you need to install something very special: sqlalchemy-redshift(not pandas_redshift) via anaconda prompt: pip install sqlalchemy-redshift otherwise, you might get some error message like the following: "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2",

After you install that package, restart your python kernal/spyder program, the error message should go away. In case if you are using the jupyter notebook, use this code to install: !pip install sqlalchemy-redshift
import sqlalchemy as sa
import pandas as pd
#import sqlalchemy_redshift  
#import psycopg2

engine_setup = sa.create_engine('redshift+psycopg2://
{"keepalives": 1, "keepalives_idle": 60, "keepalives_interval": 60} 
sql_query = """select var1,var2 from data0 where var_filter = '****' """
data_in=pd.read_sql_query(sql_query, con=engine_setup)
Here is an example of create a table inside redshift, distributing data evenly.
eng = sa.create_engine('redshift+psycopg2://
# execution_options={'autocommit': True},
  connect_args={"keepalives": 1, "keepalives_idle": 60,
   "keepalives_interval": 60}  
#   isolation_level="AUTOCOMMIT"
eng.execute('DROP TABLE IF EXISTS dataset1')                         
eng.execute("""create table dataset1 
    select var1,
    dateadd(day,-365,current_date) as ref_date,
    count(id) over (partition by id 
    order by sale_date desc rows unbounded preceding) as rank
    ##create a rank variable for each tran with the same id##
    from dataset0
    where var_filter= var_used
      """.format(var_used=****) ) 
If you have created a table/dataset inside redshift, some othe people might have permission issue to access, here is the code you can use to grant the access:
##grant access to all tables from schema "public" to someone##
##it might stop running due to one error message!## 
grant select on all tables in schema public to some_other_id

##you can use the following to create the grant list##
##this is extremely useful in permission issue###
select 'grant select on table projectz.public.'||tablename
||'  to skasdf;' from pg_tables where tableowner='kasdli';

##to solve the permission issue from the root: 
  ;alter default privileges for user sk****, principal
    in schema public 
    grant all privileges on tables 
    to group user_group_name1;

##grant access to some table in some database to  someone##
select current_user;
grant all on all tables in schema public to PUBLIC
grant select on table database.public.some_other_tabe  to some_other_id

##in aginity redshift if you want to see the table list:
select *  from pg_tables
select *  from PG_TABLE_DEF

##in aginity redshift if you want to see the column list:
select * from pg_table_def
where table_name='uasdfv_uadt_ddsat9';

##or use this one:
select * from information_schema.columns 
where table_name='uasdfv_ddsat9';

##use the following code to the size of each dataset##
SELECT name AS table_name, 
ROUND((COUNT(*) / 1024.0),2) as "Size in GB"
FROM stv_blocklist
(SELECT DISTINCT id, name FROM stv_tbl_perm) names
ON = stv_blocklist.tbl

##in aginity netezza if you want to see the table list:
select * from _v_table_storage_stat

#rename the table: 
alter table old_name
rename to new_dataset;

#drop a column: 
alter table data1
drop column var1;

#rename a column: 
alter table data1
rename column old_name to new_name;

#change data type
cast(old_column as real) as new_name

####change owner of table
alter table table_name
owner to new_owner;

Question: How to replace sustring via SQL in redshift ?
Use REPLACE function to replace the entire string with another string, for example,
if I want to replace the double space by a single space, I can use the following:
select var1,REPLACE(var1, '  ', ' ') as var2,count(*) as cnt
 from data1 
 where REGEXP_REPLACE(var1, '[^0-9a-zA-Z]+', '_')='Oldsmobile_Silhouette' 
 group by 1,2
 order by 1,3 desc
###replace "Shows" in the string with "Theatre"
 replace(catgroup, 'Shows', 'Theatre')
#substitute any character that is non 0-9a-zA-Z by a single "_"
select REGEXP_REPLACE(vehicle_map, '[^0-9a-zA-Z]+', '_') 
##deletes the @ and domain name from email addresses
select regexp_replace( email, '@.*\\.(org|gov|com)$') 
##replaces the domain names with
regexp_replace(url, '^.*\\.[[:alpha:]]{3}/', '') changed to:
##substitute with the corresp. letters:
select translate('mint tea', 'inea', 'osin');
##will output to: most tin

REPLACE is similar to the TRANSLATE Function and the REGEXP_REPLACE Function, except that TRANSLATE makes multiple single-character substitutions and REGEXP_REPLACE lets you search a string for a regular expression pattern, while REPLACE substitutes one entire string with another string.

If you need to upload data from S3 bucket in AWS, then you need to create the schema/template for the data first in redshift, here are the code to generate the schema, you can run that in aginity redshift, or use aginity ==> script==> DDL to query window:
drop table IF EXISTS public.dataset1;
CREATE TABLE public.dataset1
 tran_id BIGINT ENCODE mostly32,
 vin VARCHAR(256) ENCODE lzo,
 sold_dt VARCHAR(256) ENCODE lzo,
 drivetrain VARCHAR(65535) ENCODE lzo
  #if you are not sure, use all VARCHAR(256) unless for really long string 
Then you can run the following code in 2 different environment: 1. Continue run in the aginity redshift environment, this will be much faster:
delete from data_simple
copy data_simple
from 's3://S3_buket/Transaction_data_simple.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::023456778891:role/RedshifttoS3'
IGNOREHEADER 1    # handle the csv file header
gzip # to hanle the zipped csv file
IGNOREBLANKLINES    #to handle the blanklines error
FILLRECORD   # to autofill the empty fields in the data
2. In the python environment, either spyder or jupyterlab to upload csv or manifest from S3 to redshift, seems slow and might have mistakes.
fs = s3fs.S3FileSystem()
eng = sa.create_engine('redshift+psycopg2://{user}:{password}
  execution_options={'autocommit': True},
  connect_args={"keepalives": 1, "keepalives_idle": 60,
   "keepalives_interval": 60}  
#   isolation_level="AUTOCOMMIT"

#example uploading csv data
#better run this in the spyder environment, in the sagemaker/jupyterlab might not work
eng.execute("""delete from dataname1""")
eng.execute("""copy dataname1
from 's3://S3_bucketlocation/data1.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::#######:role/RedshifttoS3'
GZIP #for the compressed file in format .gz 
#example uploading manifest data                          
eng.execute('DROP TABLE IF EXISTS dataset1')                         
eng.execute("""copy unified_transactions_ext
from 's3://S3_bucketlocation/data_extmanifest' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567:role/RedshifttoS3'
Here is the warp-up of upload data onto redshift via: 1. From a python dataframe 2. From a csv file in S3 bucket.
import psycopg2
from psycopg2.errors import *
import pandas_redshift as pr
import pandas as pd
from boto3 import client

class UploadToRedshift:
    def __init__(self, schema = "public", dbname = "projectz", port = "5439", user = "asdhgig", \
                db_password = "123456789", db_url = "**",\
                aws_access_key_id = "*******", aws_secret_access_key = '*******',\
                transfer_bucket = "data1", 
        aws_iam_role = "arn:aws:iam::*******:role/RedshifttoS3"):
        self.schema = schema
        self.dbname = dbname
        self.port = port
        self.user = user
        # redshift password
        self.password = db_password
        self.host_url = db_url
        # aws iam role which have access to 
        self.aws_iam_role = aws_iam_role
        # Long term aws_access_keys, only work for s3 bucket uv-prod-batch
        self.aws_access_key_id = aws_access_key_id
        self.aws_secret_access_key = aws_secret_access_key
        # Intermediate bucket. Only files in S3 EMR Dynamo DB can be loaded to Redshift,
        # so we need this as transfer hub.
        self.transfer_bucket = transfer_bucket
    def upload_from_s3_to_redshift(self, table_name, file_path,  newtable = False, 
                          query_gen = "", delimiter = ",",  skip_header = 1):
        """Query_gen is used to create new table before upload"""
        '''This method will unload redshift table into S3'''
        conn_string = "dbname='{}' port='{}' user='{}' password='{}' host='{}'"\
            .format(self.dbname, self.port, self.user, self.password, self.host_url)
            con = psycopg2.connect(conn_string)
            print("Connection Successful!")
            print("Unable to connect to Redshift")

        cur = con.cursor()
        # If it's not in redshift, need to create one before upload to redshift
        if newtable:
                print("NEW table created")
            except Exception as e:
                print("Failed to create new table")

        query_copy = """copy {}.{} 
                        from '{}'
                        iam_role '{}'
                        DELIMITER '{}' 
                        IGNOREHEADER {} 
                        .format(self.schema, table_name, file_path, \
                                self.aws_iam_role, delimiter, skip_header)

            print("File {} was uploaded".format(file_path.split("/")[-1]))
        except InternalError_ as e:
            print("Create table " + table + " before copy to it")
        except Exception as e:

    def pd_dtype_to_redshift_dtype(self, dtype):
        # Modified pd_dtype_to_redshift_dtype in pr package to avoid overflow.
        if dtype.startswith('int64'):
            return 'BIGINT'
        elif dtype.startswith('int'):
            return 'INTEGER'
        elif dtype.startswith('float'):
            return 'FLOAT'
        elif dtype.startswith('datetime'):
            return 'TIMESTAMP'
        elif dtype == 'bool':
            return 'BOOLEAN'
            return 'VARCHAR(256)'

    def upload_pandas_df_to_redshift(self, dataframe, table_name):
        pr.connect_to_redshift(dbname = self.dbname, 
                                user = self.user, 
                                password = self.password, 
                                port = self.port, 
                                host = self.host_url)
        pr.connect_to_s3(aws_access_key_id= self.aws_access_key_id, 
                     aws_secret_access_key= self.aws_secret_access_key, 
                     bucket = self.transfer_bucket, 
                     aws_iam_role = self.aws_iam_role)

        new_dtypes = []
        for dtype in dataframe.dtypes:

            pr.pandas_to_redshift(data_frame = dataframe, redshift_table_name = table_name, 
                                 save_local= False, column_data_types = new_dtypes)
            print("Dataframe was uploaded to redshift as {}".format(table_name))
            conn = client('s3')  
            for key in conn.list_objects(Bucket= self.transfer_bucket)['Contents']:
                if key["Key"].startswith(table_name):
                    conn.delete_object(Bucket = self.transfer_bucket, Key = key['Key'])
            print("Temp file in S3 has been deleted")
        except Exception as e:
            print("ERROR: {}".format(str(e)))

def data_to_s3_csv(s3_location, df):
    fs = s3fs.S3FileSystem()
    with, "wb") as f:
        bt = df.to_csv(None, header=True, index=False).encode();
import importlib
upload_redshift = importlib.import_module('.upload_to_redshift', package='dsv-aws-common')
upload_redshift1 = importlib.import_module('dsv-aws-common.upload_to_redshift')
#upload_redshift is the same as  upload_redshift1

import pandas as pd
#data = pd.read_csv(file1)
upload1 = UploadToRedshift()
upload1.upload_pandas_df_to_redshift(dataframe=all0, table_name = "****_scoring_avg_all")

file1 = os.path.join('s3://uv-prod-****/a_uv_****/output/', '****_scoring_avg_all.csv')

file_path = "s3://s3_bucket_name/data_name.csv"
table = "daily_avg_test"

sql_create = """CREATE TABLE IF NOT EXISTS {}(
                MID BIGINT,
                EDITIONDATE INT,
                AuctionAvgMile FLOAT,
                StdDevMileage FLOAT,
                AvgInsightCond FLOAT,
                StdDevPrice FLOAT,
                BasePrice FLOAT);""".format(table)

upload.upload_from_s3_to_redshift(table_name = "data1", file_path = "s3://bucket1/data_name.csv",
                                  newtable = False, 
                                  query_gen = sql_create, 
                                  delimiter = ",",  
                                  skip_header = 1)

Question: How do we download data from IBM netezza data warehouse?
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
print("Connection complete")

cursor.execute("SELECT * FROM data_input limit 1000")
colnames=[desc[0] for desc in cursor.description]


Question: How do we upload data from python into IBM Netezza data warehouse?
import pandas as pd
import pyodbc 
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=***.****.com;
cursor = conn.cursor()
# create table
cursor.execute("CREATE TABLE TEST("
    "C_ID  INT,"
    "C_FP  FLOAT,"
    "C_BOOL BOOL)")
cursor.execute("INSERT INTO test VALUES(1,1.1,'asdgaing',
      '1901-01-01','23:12:34','1901-01-01 09:00:09','t')")
cursor.execute("INSERT INTO test VALUES(2,3.4,'asdfg2345',
      '1991-11-11','00:00:01','1981-12-31 19:19:19','f')")
cursor.execute("SELECT * FROM TEST")
rows = cursor.fetchall()
for row in rows:
    print(row, end='\n')
cursor.execute("DROP TABLE TEST CASCADE")

Question: How do we upload data from python onto amazon redshift?
1st method: this is relatively easier method, read csv to python, then upload to redshift using pandas_redshift.

import pandas_redshift as pr
pr.connect_to_s3(aws_access_key_id = '***********',
              bucket ='bucket_name',
              aws_secret_access_key = '*********'
pr.connect_to_redshift(dbname = "your_database_name",
               user = "your_user_name",
               password = "your_pw",
               port = "5439",
               host = "****")

(['object',     'float64', 'int64',  'datetime64[ns]'],
['VARCHAR(255)','FLOAT4',   'INT', 'TIMESTAMP'])
pr.pandas_to_redshift(data_frame = Your_data_in_python,
redshift_table_name = 'New_dataname_in_redshift',column_data_types =data_type) 

2nd method: another way using copy command to upload csv file to redshift:

--Step 1. create the schema via aginity workbench for redshift: 
drop table IF EXISTS public.data_name_in_redshift;
CREATE TABLE public.data_name_in_redshift
 num_var1 NUMERIC(15, 5) ENCODE bytedict,
  num_var2 NUMERIC(15, 5) ENCODE bytedict,
 char_var1 VARCHAR(256) ENCODE lzo

##Step 2. upload the csv file onto the S3 bucket, make sure: 
#a. no header allowed in the csv. 
#b. the order of variables matches with the schema set up. 
##Step 3. run the following in python: 
ut8 = run_query("""delete from data_name_in_redshift""")
ut9 = run_query("""copy data_name_in_redshift
from 's3://folder_location/csv_file_name.csv' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::1234567890:role/RedshifttoS3'
--if there are some errors, run the following: 
select * from stl_load_errors limit 1000;

Question: How do we use simple sql in python dataframe?
Answer: In python you can use a lot other embedded functitons to do ".groupby" or ".pivot_table" summries statistics, it's very flexible, however, the syntax is not that easy to remember, you probably need to check the example syntax everytime before you need to use those.

Most people are familiar with SQL, either from the previous analytic IT related training, or from the previous SAS/R SQL experience, so it would be great if we can continue using the similar SQL query in python. We can use pysql for this purpose.
pip install pysql ##== run this in anaconda prompt to install==

from pandasql import sqldf
pysql = lambda q: sqldf(q, globals())  
##==========more explanation for globals/locals later=========##

data1=pysql("""select var0,count(*) as cnt,sum(var1)/sum(var2) 
          as pcnt from data0 group by 1""")
##very tricky: if you use pysql in a defined functiton, 
##then need to define data0 as global inside function.           
data2=pysql("""select a.*,b.pcntt from data0 as a 
          inner join data1 as b on a.var0=b.var0""")

##==replace function in sqlite ==##          
SELECT REPLACE('xx is very light, xx is easyn', 'xx', 'SQLite')
SELECT Name, SUBSTR(Name,2,4), SUBSTR(Name,2) FROM Students;

##==select aut_name ending with string 'on'
SELECT aut_name FROM author WHERE LIKE('%on',aut_name)=1;

SELECT aut_name FROM author WHERE WHERE LIKE('L_n_on',aut_name)=1;
##==Use 3 double quotes for mutiple lines, or use 1 double quote.          
Notice pysql is working with sqlite databases, you can use a lot of SQL related functions,including group by, join, having statement(inside amazon redshift, you can't use having statement). However, there is some limitations when you use pysql with sqlite. You can only use the following core funtions inside sqlite: avg, count,max, min, sum, abs, random(random() function takes no argument), round(12.356,2), substr(string_var,starting_position,num_of_char), instr(orig_string_var,substr_to_find), upper, lower, replace, length, ltrim, rtrim, trim,coalesce.

WHERE LIKE('%on',aut_name)=1; ##return anything with on at the end of the string.
WHERE LIKE('L_n_on',home_city)=1; ##return anything like: London, has to be exact location match.

If you have to use some other functions, like exp(), ln(), then you can use the numpy package. In case if you are interested in using pivot_table or groupby functions to do summaries, here are some examples.
test3 = pd.merge(test1, test2, left_on=['var1','var2'], right_on=['var_a','var_b'],
left_index=False, right_index=False,how='left')

list1 = ['metric_var1','metric_var2']
summary1 = ds[ (ds['var1']>=0) & (ds['var2']==0)]
.pivot_table(index='var0', values=list1, aggfunc='sum')

summary2=ds[(ds['var1']>=date0) & (ds['var2']==0)]
.groupby('var0',as_index=False).agg({"var4": 'count'})
#note that groupby will not include any nan values, i.e. groupby output is not displaying null values, we need to fillna first via:

seg1b=seg1a.groupby('mid', as_index=False).agg({'vin':'count','var2':'mean'})
seg1c=seg1b[(>=5)] #the columns name: mid, vin, instead of mid,count
t2c=t2b.groupby(['sale_yr'],as_index=False).agg({"ratio_list": [("avg1",'mean'),("median1",'median')]})

Syntax of if...elif...else:
if test expression:
Body of if
elif test expression:
Body of elif
Body of else

Question: Example of making directory in python


cwd = os.getcwd()
cwd1 = os.path.join(cwd,'datain/')
cwd2 = os.path.join(cwd,'dataout/')
cwd3 = os.path.join(cwd,'data_log/')

import os.path
from os import path

for state1 in list_state0:
state_start =

except FileExistsError:
# directory already exists

Question: What is the code __name__ == "__main__" for?
Python files are called modules and they are identified by the .py file extension. A module can define functions, classes, and variables. So when the interpreter runs a module, the __name__ variable will be set as __main__ if the module that is being run is the main program.

But if the code is importing the module from another module, then the __name__ variable will be set to that module’s name. Let's take a look at an example. Create a Python module named and paste this top level code inside:
# Python file for both print("File one __name__ is set to: {}" .format(__name__)) #===============================================================

# Python file for both print("File two __name__ is set to: {}" .format(__name__)) #===============================================================

if we run this file, from the command line: python
output: File one __name__ is set to: __main__ #===============================================================
#revise import file_2 print("File one __name__ is set to: {}" .format(__name__)) #===============================================================

if we run this file, from the command line: python, output:
File two __name__ is set to: file_two
File one __name__ is set to: __main__

There is a really nice use case for the __name__ variable, whether you want a file that can be run as the main program or imported by other modules. We can use an if __name__ == "__main__" block to allow or prevent parts of code from being run when the modules are imported.

Question: What's mutable and immutable objects in Python?
Simply put, a mutable object can be changed after it is created, and an immutable object can't.
Mutable objects: list, dict, set, byte array
Immutable objects: int, float, complex, string, tuple, bytes
Mutable objects: dataframe, dict, list.

List is a collection which is ordered and changeable. Allows duplicate members.
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
Set is a collection which is unordered and unindexed. No duplicate members.
Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.

The built-in function id() returns the identity of an object as an integer. This integer usually corresponds to the object's location in memory Here is example of mutable item: dict
dict1['key1'] = 'var1'
dict1['key2'] = 'var2'

id(dict2) == id(dict1) ##True
#drop an item from dict1
dict1.pop("key2", None) # or dict2.pop("key2", None)    
id(dict2) == id(dict1)  ##still True
print(dict1)  #dropped key2
print(dict2)  #dropped key2

Here is an example of getting "first" few items in the dictionary:
list( dict1.items())[:10]
list( dict1.keys())[:10]
list( dict1.values())[:10]

data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
df1 = pd.DataFrame(data1)

    A   B
0   111 121
1   112 122
2   113 123
df2 = df1

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
#df1.drop('A', axis=1, inplace=False) 
#inplace=False will not change df1, df2.
df3=df1.drop('A', axis=1, inplace=False) 
#df3 will drop A, df1 will not change.  
df1.drop('A', axis=1, inplace=True)
df1  #both df1 and df2 will drop A
0   121
1   122
2   123
Here is example of mutable itme: list
m = list([1, 2, 3])
n = m

#We are creating an object of type list. identifiers m and m tagged to the same list object, which is a collection of 3 immutable int objects.
id(m) == id(n)
#Now poping the last item from list object does change the object,

#object id will not be changed
id(m) == id(n)

#If you do copy() statement, then n will not change value
m = list([1, 2, 3])

#another tricky example: 
>>> a = [1, 3, 5, 7]
>>> b = a
>>> b[0] = -10
>>> a
[-10, 3, 5, 7]

#immutable object: string, int, floats, bools
>>> a = "Karim"
>>> b = "Karim"
>>> id(a)
>>> id(b)

We can use .copy() function to make a real copy of the original dataframe and will not change the original one, even if you change the new dataframe. Here are some examples:
df = DataFrame({'x': [2,4]})
df_sub = df[0:1]
df_sub.x = -1 
#You'll get the new value for the first x:
0 -1
1 4

#No value will be changed if you use .copy()
df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1
print(df) # you will get same value as before
when you are changing part of the dataframe, you might get a warning message:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

Here is example of immutable item: tuple Tuple() is also ordered, almost the same as list[], except that tutple is immutable, not able to change its value after it's created.
>>> a = ["apples", "bananas", "oranges"]
>>> id(a)
>>> a[0] = "berries"
>>> a
['berries', 'bananas', 'oranges']
>>> id(a)

>>> a = ("apples", "bananas", "oranges") #notice the difference: () vs []
>>> a[0] = "berries"
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'tuple' object does not support item assignment
#we can't change the value from the tuple

#However, be careful for this tuple, it seems the value get change? No!
>>> a = ("apples", "bananas", "oranges")
>>> id(a)
>>> a = ("berries", "bananas", "oranges") 
#essentially created a new tuple with new ID in the memory. 
>>> id(a)
Question: What's global/local variable/dataframe/environment in Python?
Before dealing the global vs local, we need to keep in mind that, Globals in Python are global to a module, not across all modules. (Many people are confused by this, because in, say, C, a global is the same across all implementation files unless you explicitly make it static.) For instance, has the following code:
from math import *
import module1

def wow():
    print pi

wow() has the following code:
def cool():
    print pi
You will get en error message when you run NameError: global name 'pi' is not defined! since the pi is not defined in the module1.

If you really want to use that in, you can add this code in

module1.pi = pi

Week 2: #1: Example of a simple event-driven program
# Import CodeSkulptor GUI module
import simplegui

# Event handler
def tick():
    print "tick!"

# Register handler
timer = simplegui.create_timer(1000, tick)

# Start timer
#2: global vs local examples
# num1 is a global variable
num1 = 1
print num1

# num2 is a local variable
def fun():
    num1 = 2
    num2 = num1 + 1
    print num2

# the scope of global num1 is the whole program, num 1 remains defined
print num1

# the scope of the variable num2 is fun(), num2 is now undefined
print num2

# why use local variables?
# give a descriptive name to a quantity
# avoid computing something multiple times

def fahren_to_kelvin(fahren):
    celsius = 5.0 / 9 * (fahren - 32)
    zero_celsius_in_kelvin = 273.15
    return celsius + zero_celsius_in_kelvin
print fahren_to_kelvin(212)

# the risk/reward of using global variables
# risk - consider the software system for an airliner
# critical piece - flight control system
# non-critical piece - in-flight entertainment system
# both systems might use a variable called "dial"
# we don't want possibility that change the volume on your audio
# causes the plane's flaps to change! 

# example
num = 4

def fun1():
    global num
    num = 5
def fun2():
    global num
    num = 6

# note that num changes after each call with no obvious explanation    
print num
print num
print num

# global variables are an easy way for event handlers
# to communicate game information.
# safer method - but required more object-programming techniques
#Local vs. Global Variables Example # For this example, there are five versions of the same program. Three of them work and two of them don't. The goal of the program is to move an object along a number line using a move method, keep track of its location with the variable loc, and calculate the displacement (always positive) from its starting location with the displacement method.
# Version one: Doesn't work. The move method throws an error,
# while the displacement method always prints 0 because
# the location never changes.
start = 1
loc = start

def move(x):
    loc = loc + x

def displacement():
    return abs(loc - start)

print "Version 1:", displacement()
print "Version 1:", displacement()

# Version two: Works. Fixes version one by declaring variables
# global at the start of the move method. No global
# declaration is needed at the start of the displacement
# method because the values in the global variables loc 
# and start are not changing.

start = 1
loc = start

def move(x):
    global loc
    loc = loc + x

def displacement():
    return abs(loc - start)

print "Version 2:", displacement()
print "Version 2:", displacement()

# Version three: Also works. This one returns values instead
# of attempting to override the global variable. Notice 
# that the new local variable must have a different name.
# Notice also that we must assign loc to the value returned
# by the move method.

start = 1
loc = start

def move(x):
    pos = loc + x
    return pos

def displacement():
    return abs(loc - start)

loc = move(3)
print "Version 3:", displacement()
loc = move(-2)
print "Version 3:", displacement()

# Version Four: This one does not work. The loc that is 
# a parameter in the move method is actually a local 
# variable instead of a global one, and therefore the 
# value of the global loc does not change.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

move(3, loc)
print "Version 4:", displacement()
move(-2, loc)
print "Version 4:", displacement()

# Version Five: This one fixes the problem from version 
# four. This one passes the method the value of loc as a 
# parameter, and returns the value of the new loc. Note 
# that in this example the local variable shares the same
# name as the global one, but is not actually the same thing.

start = 1
loc = start

def move(x, loc):
    loc += x
    return loc

def displacement():
    return abs(loc - start)

loc = move(3, loc)
print "Version 5:", displacement()
loc = move(-2, loc)
print "Version 5:", displacement()

# CodeSkulptor runs Python programs in your browser.
# Click the upper left button to run this simple demo.
# CodeSkulptor runs in Chrome 18+, Firefox 11+, and Safari 6+.
# Some features may work in other browsers, but do not expect
# full functionality.  It does NOT run in Internet Explorer.

import simplegui
message = "Welcome!"

# Handler for mouse click

def click():
  global message
  message = "Good job!"

# Handler to draw on canvas

def draw(canvas):

  canvas.draw_text(message, [50,112], 36, "Red")
  # Create a frame and assign callbacks to event handlers
  frame = simplegui.create_frame("Home", 300, 200)
  frame.add_button("Click me", click)

  # Start the frame animation

No comments:

Post a Comment

AWS Study notes: step function vs lambda function

Step functions == coordinator(project manager) lambda function == the main worker for each task(project developer). Step functions are ...