Tuesday, December 17, 2019

Python study notes 4: error warning issues and solutions


How do we get rid of NaN and inf values/rows?
spyder is already running, restart kernel failed
Spyder IDE not opening because of problems with spyder.lock
Github error: src refspec remotes/origin/ matches more than one
Github error: Ambiguous object name: 'remotes/origin/'
Python IndentationError: unindent does not match any outer indentation level


Issue: "LinAlgError: SVD did not converge in Linear Least Squares" How do we get rid of NaN and inf values/rows in python dataframe? It will you some trouble when you have NaN and inf values in your dataframe, sometimes you might even get some error message: "LinAlgError: SVD did not converge in Linear Least Squares" when you were running something like lr.fit() in python, try to get more clean dataframe before you run those regression.
#===============================================================
#drop all the inf and NaN ones: 
df1=df.replace([np.inf, -np.inf], np.nan).dropna()
#first to replace those +-inf with nan, then dropna
#Drop the rows where at least one var missing
df1=df.dropna()                                   
#Drop the columns where at least one var missing
df1=df.dropna(axis='columns')
#Drop the rows where all vars are missing
df1=df.dropna(how='all') 
#Keep only the rows with at least 2 non-NA values.
df1=df.dropna(thresh=2)    
#Define in which columns to look for missing values
df1=df.dropna(subset=['name', 'born'])                  
#===============================================================
Issue: When creating a day-of-week column in a Pandas dataframe using Python, you might get some error message like: AttributeError: 'Series' object has no attribute 'weekday' ?
#===============================================================
import pandas as pd                                                                                      
import csv                                                                                               
df = pd.read_csv('data.csv', parse_dates=['date']))                                                      
df['day-of-week'] = df['date'].weekday()
AttributeError: 'Series' object has no attribute 'weekday'
#solve this issue using: 
df['day-of-week'] = df['date'].dt.weekday()
df['day-of-week'] = df['date'].dt.dayofweek
df['day-of-week'] = pd.datetime(df['date']).dt.dayofweek 
#===============================================================


Error/Issue: "spyder is already running", "restart kernel failed" in spyder, "Spyder IDE not opening because of problems with spyder.lock", "Spyder fails to start because of problems with lockfile", "Spyder Does Not Launch" ?
Answer: if you tried to double click the Spyder icon to open the spyder IDE, nothing happened. What's happening?
There are a few solutions you can try. The 1st solution:
1. Inside the spyder IDE, click the top options:
Tools > Prefences > General
2. Click "Advanced Settings" tab,
3. Deactivate the option called
[ ] Use a single instance

The 2nd solution:
1. Go to the following folder directory:
C:\Users\username\.spyder2 (for pyther 2.# version)
C:\Users\username\.spyder-py3 (for pyther 3.# version)
Inside that folder, find the subfolder: spyder.lock, and delete that subfolder. Then you should be able to open a new Spyder session. The trouble is that, if you want to continue to open another spyder session, you have to delete that subfolder(just created again), then you can open another new spyder session.
The 3rd solution:
1. Open anaconda promt: spyder
2. Showing: "spyder is already running, If you want to open a new instance, please pass to it the --new-instance option"
3. Then anaconda promt type: spyder --new-instance
Then you can open the spyder IDE again.
The possible 4th solution:
on the Spyder IDE top options: tools -> Reset Spyder to Factory Defaults.
Error/warning: "Boolean Series key will be reindexed to match DataFrame index"
This error/warning happened usually when we are trying to work on two different stuff at the same expression, for example:
df.loc[a_list][df.a_col.isnull()]
There are two conditions/filters we are trying to apply at the same time. The warning comes from the fact that the boolean vector df.a_col.isnull() is the length of df, while df.loc[a_list] is of the length of a_list, i.e. shorter. Therefore, some indices in df.a_col.isnull() are not in df.loc[a_list]. To get rid of the warning, you can try one of the 3 solutions:
Solution 1, make the selection of indices in a_list a boolean mask:
df[df.index.isin(a_list) & df.a_col.isnull()] 
Solution 2, do it in two steps:
df2 = df.loc[a_list]
df2[df2.a_col.isnull()] 
Solution 3, if you want a one-liner, use a trick found here:
df.loc[a_list].query('a_col != a_col')    
Warning: "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead "
Answer: Most likely this is not a big issue, this happened most of the time when we were trying to create a new variable or reset the value of some existing variable in a data frame, for instance,
data[data.var1 == '123456789']['var2'] = 100
We are setting the value 100 to the variable var2 for the records of var1='123456789', so for the records of var1<>'123456789', var2 will take the null/nan value.
To get rid of the error, using .loc function:
data.loc[data.var1 == '123456789','var2'] = 100
In cerntain situations, you might still get the warning message, for instance:
data1=data[data.var1 == '123456789'] 
data1.loc[304, 'var2'] = 100  
To prevent this warning message, there are 2 options:
1. use copy() option:
data1=data[data.var1 == '123456789'].copy() 
data1.loc[13, 'var2'] = 100 ##reset 13th row ##
2. Use the following command to prevent all warnning message(not highly recommended for the beginner, you might want to investigage first to make sure no other issues.)
pd.set_option('mode.chained_assignment', None)
You can also use the following option to elevate/emphase this warning message
pd.set_option('mode.chained_assignment', raise)
The default option is
pd.set_option('mode.chained_assignment', warn).
Some other python setup might be helpful during the data exploration:
1. How do we display all the columns names with values in a data frame with head() function? By default, we can see the first 10 variables with the first 6 rows, and the last 10 varaibles with the first 6 rows, then the variables in the middle are all truncated. You can use the following setup in pandas:

 pd.set_option('display.max_columns', None) 
2. Sometimes, the variable might have a very long string as its value, and can't display the whole text in the log window, you can use the following setup to see the whole text:
##*****************************************##
    pd.set_option('display.max_colwidth', -1)    
##==some other setup about the display:  
    pd.set_option('display.max_rows', 100)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 150)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
##==reset to the default ==##
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')
##*****************************************##
Error issue: When you run the following SQL: select * from dataset1, you got this error: Unable to cast object of type 'System.Byte[]' to type 'System.IConvertible'.Couldn't store in var_trouble Column. Expected type is Double.
This is most likely due to the data precision issue, especially when the data is loaded from python to redshift, we have to specify the correponding format in redshift/S3. To get rid of the error, you can use: cast(that_trouble_var as real) or cast(that_trouble_var as float4) to make it not float/double precision.
Name Storage Range
SMALLINT or INT2 2 bytes -32768 to +32767
INTEGER, INT, or INT4 4 bytes -2147483648 to +2147483647
BIGINT or INT8 8 bytes -9223372036854775808 to 9223372036854775807

Name Storage Range
REAL or FLOAT4 4 bytes 6 significant digits of precision
DOUBLE PRECISION, FLOAT8, or FLOAT 8 bytes 15 significant digits of precision
#=========================================================== 
datatype2=data_set1.dtypes.replace(['object','float64','int64','int32','datetime64[ns]'],['VARCHAR(255)','FLOAT4','BIGINT','INT','TIMESTAMP'])
##don't use float, instead, use float4##
import pandas_redshift as pr
#upload prediction and coefficients                                                                                                                
# Connect to S3
pr.connect_to_s3( aws_access_key_id = '##############',           
                 aws_secret_access_key = '***********************',
                 bucket ='projectzdata-nzref'
               # As of release 1.1.1 you are able to specify an aws_session_token (if necessary):
               # aws_session_token = 
               )
# write to redshift
pr.connect_to_redshift(dbname = "************", 
                user = "########", 
                password = "$$$$$$$$$$", 
                port = "5439", 
                host = "************.redshift.amazonaws.com")
pr.pandas_to_redshift(data_frame = data_set1,
                       redshift_table_name = 'data_set_name_redshift', column_data_types=datatype2)
##*****************************************##

This is related with the classic old Floating Point Arithmetic issue:
In floating point your rounded version is the same number. Since computers are binary, they store floating point numbers as an integer and then divide it by a power of two so 13.95 will be represented in a similar fashion to 125650429603636838/(2**53).
Double precision or float numbers have 53 bits (16 digits) of precision and regular float4 have 24 bits (8 digits) of precision.
No matter how many base 2 digits you're willing to use, the decimal value 0.1 cannot be represented exactly as a base 2 fraction. In base 2, 1/10 is the infinitely repeating fraction
0.0001100110011001100110011001100110011001100110011...
Stop at any finite number of bits, and you get an approximation. On most machines today, floats are approximated using a binary fraction with the numerator using the first 53 bits starting with the most significant bit and with the denominator as a power of two. In the case of 1/10, the binary fraction is 3602879701896397 / 2 ** 55 which is close to but not exactly equal to the true value of 1/10.
Another check point you can verify this, you can use: round(that_trouble_var,2) then you will get this error: "Double precision round overflow", you have to convert that_trouble_var to be real or float4, instead of just using float.
Another temp solution to this: select var1, var2, var_trouble:: Double from dataset1, in other words, trying to manually change the type. Or simply skip that trouble varaibles.
Another Similar Error issue when you upload the data from python to redshift(via S3): InternalError: Load into table 'this_table_name' failed. Check 'stl_load_errors' system table for details.
You can run the sql: select * from stl_load_errors from the aginity redshift, check the potential issue for that corresponding uploading sql error/issue, something like: "Overflow, 1.3365138325104744e+58 (Float valid range 1.175494e-38 to 3.402823e+38)", that means, we need to have double float format, we can't use float4 format in the following code:
datatype2=data_set1.dtypes.replace(['object','float64','int64','int32','datetime64[ns]'],['VARCHAR(255)','FLOAT','BIGINT','INT','TIMESTAMP'])
##don't use float4, instead, use float##
Rounding Issue in pyth on version 3.*.*: round(2.5), ==> outputs 2. round(1.5) outputs 2. What's happending? ?
Answer: Python 3 uses a different rounding behaviour compared to Python 2: it now uses so-called "banker's rounding": when the integer part is odd, the number is rounded away from zero; when the integer part is even, is it rounded towards zero. In other words, when you are at the exactly half, *.5, you are rounding towards(but not always equal to) the most closest even number.
The reason for this is to avoid a bias, when all values at .5 are rounded away from zero (and then e.g. summed).
You can double check with the following output:
##*****************************************##
print(round(6.5)) 
print(round(5.5)) 
print(round(4.5))
print(round(3.5)) 
print(round(2.5)) 
print(round(1.5)) 
##*****************************************##
Error/issue: TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'
Answer: The easier solution you can try is to convert the first value with the format 'decimal.Decimal' to float:
df1['new_var'] = df1['old_var'].astype(float)
df1['new_var'] = pd.to_numeric(df1['old_var'], errors='coerce',downcast='float')

Don't try to change the float to the format decimal.Decimal'.

Issue: When you are trying to calculate inverse of the matrix, for instance, "linalg.inv() singular matrix"?

It's probabaly due to the following issues:
1. Two columns are perfectly correlate with each other, check with corr().
2. Check any nan in the output, if one column are all NaN, then most likely it will cause trouble.
3. use peduso-inverse to approximate the matrix inverse: np.linalg.pinv(a).

Issue: Redshift: error: "Missing data for not-null field"

This is usually happening in aws redshift Sql, due to some null or empty cell for certain columns. I would recommend using some filter like:
where substring(TRIM(var1),1,1) is not null.
Error/issue: Don't use this to install keras: conda install -c conda-forge keras ,
otherwise you might get the following errors when open "anaconda Prompt":
#===============================================================
C:\Users\Aaron>python C:\Users\Aaron\Anaconda3\etc\keras\load_config.py  1>temp.txt
C:\Users\Aaron>set /p KERAS_BACKEND= 0del temp.txt
C:\Users\Aaron>python -c "import keras"  1>nul 2>&1
#===============================================================
then "anaconda Prompt" will disappear, in other words, you can't use "anaconda Prompt" anymore! How do we fix this issue?

There are several solutions provided at stackoverflow, but all has some issues, and is not working, including:

1. Go to %UserProfile%Anaconda3/etc/conda/activate.dand right click on keras_activate.bat 2) Click on edit. Change:
python -c "import keras" 1>nul 2>&1
to this

python -c "import keras" 1> 2>&1
However, this seesm working at the first time, but after you run this next time, you might have the same issue again. Not working!

2. Temporary solution: Right before the script finishes running, you can escape it using ctrl+d, and you should be able to do whatever you want after that.

Here is the best solution, simply not use this way to install keras: conda install -c conda-forge keras, instead,use the following in "anaconda prompt":

#===============================================================
pip install tensorflow
pip install keras
#===============================================================
Then everything will just working fine! Is that awesome!?


Error/issue:
How do we create .py python file in Jupyterlab interface?

If you love to work with python in spyder interface, you probabaly don't want to use jupyter notebook. So how do we create .py file in jupyterlab environment?
Once you have the .py file, then you can right click, and go to "Create Console for Editor". I then interactively run sections of the code using Shift+Enter.

Method 1: use the following in-direct approach to create .py file in JupyterLab. go "File > New > Text File" then rename it from untitled.txt to foo.py, or run !touch foo.py in the console.

Method 2: run the following scrpts in terminal at jupyterlab (thanks to jtpio): jupyter labextension install jupyterlab-python-file

Then you can do this in jupyterlab: Click File --> New --> Python File(this is the new button you see). You might want to restart the kernel once you run the previous scripts, otherwise you might not the new "Python file" button right away.

ValueError: ValueError: cannot reindex from a duplicate axis

When we tried to assign a series to a column of dataframe, you might get this message: ValueError: cannot reindex from a duplicate axis.
#===============================================================
df1['tranid']=tranid #get error 
df1['tranid']=tranid.values  # error went away
#===============================================================

ValueError: Wrong number of items passed 2, placement implies 1
Solution: this is usually due to the mismatch for the size of the operation in Pandas dataframe. You can use shape() function to double check the size of both sides. For example, if you have duplicated columns on one side, then the duplicated column will get this error if you do any operation on that.

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() ?
Answer: This is usually happening when you are using logical operators like: and, or. Try to following solutions:

1. use the symbol & instead of the world "and", use the symbol | instead of the world "or"
2. add () for each condition/filter
3(?). in case if you are using something like the following: if (a==1) & (dt.var1==2), you might get the previous error message, you need to change to:
if (a==1) & (dt.var1==2).bool()

When you try to run division, you might get some error: ZeroDivisionError: float division by zero, then you can try this:
sum4['var1']=sum4.var2/(sum4[(sum4[var3]>=1)][var3])
The other approach: max(1,sum4.var3) is not working.


ValueError:
No module named 'pandas.core.internals.managers';
or 'pandas.core.internals' is not a package,
or, 'Series' object has no attribute 'notna',
.


Answer:Most of time, you can solve the restarting stuck issue by run the following in anaconda prompt: conda update --all.

If it's in jupyter notebook:
this is most likely due to different version of pandas, use this:
pip3 install pandas==0.24.1, or try to update pandas package via the folowing code in jupyter notebook:
!pip install --upgrade pip
!pip install --upgrade pandas 
Then restart the kenel in jupyter notebook, the error would be go away.
To install packages that are isolated to the current user, use the --user flag.



Error/Issue:
Python IndentationError: unindent does not match any outer indentation level.


Answer: here is an example of the code for you to get such an error:
if true:
    return 1
if false:
     return 2 
You see the indent for those 2 returns are clearly different. Sometimes it's not that easily and obvious to see the difference.
In that case, you can highlight/select the code, and compare them carefully, most of time you will see some small sign like "-->" on one line, and nothing on the following line,
which makes them different indent for this error. This is most likely due to one line by typing another line by copying, mixing tabs and spaces.

In case if you still can't see fix the error, copy the code to the notepad windows editor, then you should see they are not aligned.

Error/Issue: when running: from pandas_datareader import DataReader, this error message coming out: "importerror pandas.core.common import is_list_like, cannot import name 'is_list_like'""

Answer: This is due to the difference version for pandas and pandas_datareader. pandas_datareader is compatbiel with earlier version of pandas, but not compatible with the new version of pandas. You might get the following error when you run: from pandas_datareader import DataReader: "importerror pandas.core.common import is_list_like, cannot import name 'is_list_like' ".

Because the is_list_like is moved to pandas.api.types, I change the fred.py file which is highlighted in the picture. I replace from pandas.core.common import is_list_like with from pandas.api.types import is_list_like, and it works.
Another quick solution, place the following code right before: from pandas_datareader import DataReader.
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like


How to change theme in jupyter notebook?
Answer: Most people knows how to change the theme in Spyder by point-click. In jupyter notebook, we need to install package to get this:
!pip install --user -U jupyterthemes
from jupyterthemes import get_themes
import jupyterthemes as jt
from jupyterthemes.stylefx import set_nb_theme
# uncomment and execute line to try a new theme
#set_nb_theme('onedork')
#set_nb_theme('chesterish')
#set_nb_theme('grade3')
#set_nb_theme('oceans16')
#set_nb_theme('solarizedl')
#set_nb_theme('solarizedd')
set_nb_theme('monokai')

What is papermill?
Answer: execute python notebook in the production environment, execute from Python API:
import papermill as pm
pm.execute_notebook(
   'path/to/input.ipynb',
   'path/to/output.ipynb',
   parameters=dict(alpha=0.6, ratio=0.1)
)
or execute from command line, use raw strings are parameters:
$ papermill local/input.ipynb s3://bkt/output.ipynb -r version 1.0
$ papermill local/input.ipynb s3://bkt/output.ipynb -f parameters.yaml
$ AWS_PROFILE=dev_account papermill local/input.ipynb 
      s3://bkt/output.ipynb -p alpha 0.6 -p l1_ratio 0.1

What does the "-U" option stand for in pip install -U?
Answer: Type pip install -h to list help:
-U, --upgrade Upgrade all packages to the newest available version.
So, if you already have a package installed, it will upgrade the package for you. Without the -U switch it'll tell you the package is already installed and exit.
Each pip subcommand has its own help listing. pip -h shows you overall help, and pip [subcommand] -h gives you help for that sub command, such as install.

Issue: ValueError: Can only compare identically-labeled Series objects
Anwser: The folowing is an example to duplicate the error message.
#===============================================================
#zipcode_=filec[['zip5']].drop_duplicates(subset =['zip5']).zip5 
#error showed up if running the above, but not for the following:
zipcode_=filec[['zip5']].drop_duplicates(subset =['zip5']).zip5[0]
y0=z0[(z0.zip5==zipcode_)]
#===============================================================
The key is to get the element from the dataframe, but not the dataframe.

Issue: How do we convert string to a dataframe name? or How do we convert dataframe name to a string? How do we Use a dataframe's name as string ?
Answer: Sometime we want to do something like this:
#===============================================================
for i in range(5):
   data=data_i
   ...
#or something more complex:
for stocks in apple_stock_list:
    for feature in features:
        apple_stock[str(stocks) + "_"+ feature] = stocks[feature]
#===============================================================
There seems not good answer to this question directly. If you tried to use str(stocks), it will most likely not get what you want. However, we can solve this indirectly.
#===============================================================
for da in [data_make,data_model,data_trim]:
   data=da
   data['type']=data.name
...
#You have defined previously for: 
data_make['name']='make' 
data_make['model']='model'
#===============================================================

Issue: "ValueError: cannot reindex from a duplicate axis"
Anwser: The folowing is an example to duplicate the error message.
#===============================================================
wind = pd.DataFrame({'DATE (MM/DD/YYYY)': ['2018-01-01', '2018-02-01', '2018-03-01']})
temp = pd.DataFrame({'stamp': ['1', '2', '3']}, index=[0, 1, 1])
# ATTEMPT 1: FAIL
wind['timestamp'] = wind['DATE (MM/DD/YYYY)'] + ' ' + temp['stamp']
# ValueError: cannot reindex from a duplicate axis
# ATTEMPT 2: SUCCESS
wind = wind.reset_index(drop=True)
temp = temp.reset_index(drop=True)
wind['timestamp'] = wind['DATE (MM/DD/YYYY)'] + ' ' + temp['stamp']
print(wind)
  DATE (MM/DD/YYYY)     timestamp
0        2018-01-01  2018-01-01 1
1        2018-02-01  2018-02-01 2
2        2018-03-01  2018-03-01 3
#===============================================================

To fix the issue, first check the duplicated index via: df[df.index.duplicated()], then remove the duplicated index: df = df[~df.index.duplicated()], or use temp['stamp'].values intead of temp['stamp'].

No comments:

Post a Comment

Python Study notes: example of using H2O/AutoML

here is the instruction to install H2O in python: Use H2O directly from Python 1. Prerequisite: Python 2.7.x, 3.5.x, or 3.6.x 2. Insta...