Sunday, June 6, 2021

Python Study notes: how do we use Underscore(_) in Python

You will find max six different uses of underscore(_). If you want you can use it for different purposes after you have an idea about underscore(_).

  1. Use In Interpreter
  1. Ignoring Values
  1. Use In Looping
  1. Separating Digits Of Numbers
  1. Naming

    5.1. Single Pre Underscore

    5.2. Single Post Underscore

    5.3. Double Pre Undescores

    5.4. Double Pre And Post Underscores

Let's all the uses briefly with examples.

1. Use In Interpreter Python automatically stores the value of the last expression in the interpreter to a particular variable called "_." You can also assign these value to another variable if you want. To hold the result of the last executed expression in an interactive interpreter session. You can use it as a normal variable. See the example
>>> 5 + 4
>>> _     # stores the result of the above expression
>>> _ + 6
>>> _
>>> a = _  # assigning the value of _ to another variable
>>> a

2. Ignoring Values

Underscore(_) is also used to ignore the values. If you don't want to use specific values while unpacking, just assign that value to underscore(_).

Ignoring means assigning the values to special variable underscore(_). We're assigning the values to underscore(_) given not using that in future code.

See the example

## ignoring a value
a, _, b = (1, 2, 3) # a = 1, b = 3
print(a, b)

## ignoring multiple values
## *(variable) used to assign multiple value to a variable as list while unpacking
## it's called "Extended Unpacking", only available in Python 3.x
a, *_, b = (7, 6, 5, 4, 3, 2, 1)
print(a, b)

3. Use In Looping

You can use underscore(_) as a variable in looping. See the examples below to get an idea.

## lopping ten times using _
for _ in range(5):
##output: 0,1,2,3,4

## iterating over a list using _
## you can use _ same as a variable
languages = ["Python", "JS", "PHP", "Java"]
for _ in languages:

_ = 5
while _ < 10:
    print(_, end = ' ') # default value of 'end' id '\n' in python. we're changing it to space
    _ += 1

4. Separating Digits Of Numbers

If you have a long digits number, you can separate the group of digits as you like for better understanding.

Ex:- million = 1_000_000

Next, you can also use underscore(_) to separate the binary, octal or hex parts of numbers.

Ex:- binary = 0b_0010, octa = 0o_64, hexa = 0x_23_ab

Execute all the above examples to see the results.

## different number systems
## you can also check whether they are correct or not by coverting them into integer using "int" method
million = 1_000_000
binary = 0b_0010
octa = 0o_64
hexa = 0x_23_ab

print(million)  ##output: 1000000
print(binary)  ##output:2
print(octa)  ##output: 52
print(hexa) ##output 9131

5. Naming Using Underscore(_)

Underscore(_) can be used to name variables, functions and classes, etc..,

  • Single Pre Underscore:- _variable
  • Signle Post Underscore:- variable_
  • Double Pre Underscores:- __variable
  • Double Pre and Post Underscores:- __variable__

5.1. _single_pre_underscore


Single Pre Underscore is used for internal use. Most of us don't use it because of that reason.

See the following example.

class Test:

    def __init__(self): = "datacamp"
        self._num = 7

obj = Test()
print( ##output: datacamp
print(obj._num)  ##output: 7

single pre underscore doesn't stop you from accessing the single pre underscore variable.

But, single pre underscore effects the names that are imported from the module.

Let's write the following code in the my_funtions file.

## filename:-

def func():
    return "datacamp"

def _private_func():
    return 7

Now, if you import all the methods and names from, Python doesn't import the names which starts with a single pre underscore.

>>> from my_functions import *
>>> func()
>>> _private_func()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name '_private_func' is not defined
You avoid the above error by importing the module normally:
>>> import my_functions
>>> my_functions.func()
>>> my_functions._private_func()

Single Pre Underscore is only meant to use for the internal use.

5.2 single_postunderscore


Sometimes if you want to use Python Keywords as a variable, function or class names, you can use this convention for that.

You can avoid conflicts with the Python Keywords by adding an underscore at the end of the name which you want to use.

Let's see the example.

>>> def function(class):
  File "<stdin>", line 1
    def function(class):
SyntaxError: invalid syntax
>>> def function(class_):
...     pass

Single Post Underscore is used for naming your variables as Python Keywords and to avoid the clashes by adding an underscore at last of your variable name.

5.3. Double Pre Underscore


Double Pre Underscores are used for the name mangling.

Double Pre Underscores tells the Python interpreter to rewrite the attribute name of subclasses to avoid naming conflicts.

  • Name Mangling:- interpreter of the Python alters the variable name in a way that it is challenging to clash when the class is inherited.

It's way more complext, not worthy the time to deal with that.

5.4. Double Pre And Post Underscores


In Python, you will find different names which start and end with the double underscore. They are called as magic methods or dunder methods.

class Sample():

    def __init__(self):
        self.__num__ = 7

obj = Sample()

This will lead to the clashes if you use these methods as your variable names. So, it's better to stay away from them.

Saturday, May 1, 2021

NLP study notes:

word embeddding: collective term of models that learned to map a set of words or phrases in a vocabulary to vectors of numrical values. Neural Networks are designed to learn from numerical data. Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding. This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary. General approach for dealing with words in your text data is to one-hot encode your text. You will have tens of thousands of unique words in your text vocabulary. Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0. So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 values。 We use embeddings to solve this problem and greatly improve the efficiency of our network. Embeddings are just like a fully-connected layer. We will call this layer as— embedding layer and the weights as — embedding weights.
So, we use this Weight Matrix as lookup table. We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764. Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix. This process is called Embedding Lookup. The number of dimension in the hidden layer output is the embedding dimension
To reiterate :- a) The embedding layer is just a hidden layer b) The lookup table is just a embedding weight matrix c) The lookup is just a shortcut for matrix multiplication d) The lookup table is trained just like any weight matrix Popular off-the-shelf word embedding models in use today: Word2Vec (by Google) GloVe (by Stanford) fastText (by Facebook) Word2Vec: This model is provided by Google and is trained on Google News data. This model has 300 dimensions and is trained on 3 million words from google news data. Team used skip-gram and negative sampling to build this model. It was released in 2013. GloVe: Global Vectors for words representation (GloVe) is provided by Stanford. They provided various models from 25, 50, 100, 200 to 300 dimensions based on 2, 6, 42, 840 billion tokens Team used word-to-word co-occurrence to build this model. In other words, if two words co-occur many times, it means they have some linguistic or semantic similarity. fastText: This model is developed by Facebook. They provide 3 models with 300 dimensions each. fastText is able to achieve good performance for word representations and sentence classifications because they are making use of character level representations. Each word is represented as bag of characters n-grams in addition to the word itself. For example, for the word partial, with n=3, the fastText representation for the character n-grams is . are added as boundary symbols to separate the n-grams from the word itself.

Friday, March 26, 2021

Python Study notes: How to run Scala and Spark in the Jupyter notebook

Here we provide some tutorial of running scala in jupyter notebook. It can also be used for scala development with the spylon-kernel. This is an additional kernel that has to be installed separately.
## Prerequisites
* Apache Spark 2.1.1 compiled for Scala 2.11
* Jupyter Notebook
* Python 3.5+

Step1: install the package using `pip` or `conda`

pip install spylon-kernel
# or
conda install -c conda-forge spylon-kernel

Step2: create a kernel spec
This will allow us to select the scala kernel in the notebook.
python -m spylon_kernel install

Step3: start the jupyter notebook
ipython notebook

Step4:  in the notebook we select 
New -> spylon-kernel 
#This will start our scala kernel.

Step5: testing the notebook
val x = 2
val y = 3

Test: use python:

Test: we can even use spark to create a dataset:
val data = Seq((1,2,3), (4,5,6), (6,7,8), (9,19,10))
val ds = spark.createDataset(data)

Tuesday, March 16, 2021

Data Science Study Notes: Neo4j for beginer

What's Neo4j?

Neo4j a database, a graph database, a property graph database.
What's the difference btw the trational relational database and this Neo4j database?
How do we query that graph database? We can't use the traditional SQL relational tool, instead, we are using some new "graph SQL" called: Cypher:
Some example of using Cypher code, after installing the desktop, open the "example project" (movie database) in browswer, and run the code after the dollar sign:

// Get some sampel data to start: 
MATCH (n1)-[r]->(n2) RETURN r, n1, n2 LIMIT 25

//use the following exercise to demo:

//Retrieve all Movie nodes that have a released property value of 2003.
MATCH (m:Movie {released:2003}) RETURN m

// Count all nodes
RETURN count(n)

// List relationship types
CALL db.relationshipTypes()

// Count all relationships
MATCH ()-->() RETURN count(*);

// Hello World! create a database if needed:
CREATE (database:Database {name:"Neo4j"})-[r:SAYS]->(message:Message {name:"Hello World!"}) 
RETURN database, message, r
here are some examples of doing some exploration analytics:
Some examples of pattern matching:
some examples of removing duplicates:

Another data visivulization tool: CytoScape, open tool to show the complex graph network.

Saturday, February 13, 2021

Data Science Study Notes: reinforcement learning

Terminology: State vs Action vs Policy vs Reward vs State Transition. Policy function is probabality density function(PDF), policy network: use a neural network to approxiamate the policy function.
Policy-based reinforcement learning:
Policy-based reinforcement learninga algorithm:
Policy-based method: if a good policy function pai is known, the agent can be controlled by the policy: randomly sample a_t from the policy function pai. However, in reality, we don't know that, in fact, we were trying to get that. So we can approximate the policy function pai by the policy network, which is why the deep learning neural network/convolution/dense coming to play. When searching for the best policy, use the policy gradient algorithm to maximum the expecation of the reward.

The relationship:
Actor=Athelete sports players. Critic=Referee. How do we train the player to get the champion? The athelete needs a lot of practice and training, how does the player know he is getting better? It has to go through by referee! for the immediate feedback.

However, the referee themselve in the beginning might not know exactly what are the best actions, so they also need to get trained, which are good actions with high performance, and which are not. After we have trained the value network(critic) for the referee, then use that network to train the player.
How do we train the athelete(the policy network=actor):
How do we train the referee/critic together:
Train two networks: policy network(player) and value network(critic), what's happening during the training and what's after:
Train the network -1:
Train the network -2:
Train the network -3:
Value-based reinforcement learning:
Gym is a toolkit for developing and comparing reinforment learning algorithms. Classic control problem: cart pole, Pendulum, MujoCo(continuous control task, Humanoid walk continuously etc)

The Tutorial is based on reinforcement learning grandmaster Dr Wang.

Monday, February 1, 2021

GCP Study notes: Biquery API SQL examples

Use bigquery to download/upload data from/to pyspark:
from util.BigqueryHandler import BigqueryHandler

spark = sparkSession.getOrCreate()

project_id = 'project_id_name'
tmp_gcs_bucket = 'bucket_name'
bq = BigqueryHandler(project_id=project_id, spark_session=spark, tmp_gcs_bucket=tmp_bucket, sa=sa_file)

# whatever dataset you want
dataset = 'database1'

# whatever var name and table you want - returns a spark dataframe
data1 = bq.read_bq_table_spark(dataset=data_schema1, table='table_name1')

# data1 is your spark dataframe
bq.write_bq_table_spark(df=data1, dataset=data_schema2, table='table_name2')

#Save the previous file as, then in the command line run:  
spark-submit --py-files --jars spark-bigquery-with-dependencies_2.11-0.17.2.jar

#if requres special service account permission: 
spark-submit --py-files --jars spark-bigquery-with-dependencies_2.11-0.17.2.jar \
             --files piq_service.json

#here is another direct way to load/save data:
df ='bigquery')\
          .option('table', '{pid}:{ds}.{tbl}'.format(pid=project_id, ds=dataset, tbl=table))\
          .option('maxParallelism', '56')\
          .option('viewMaterializationDataset', 'database1')\
          .option('viewsEnabled', 'True')\
     #    .option('parentProject', project_id)\
     #    .option('credentialsFile', sa_file)\

#Reading from views is disabled by default. In order to enable it, 
#either set the viewsEnabled option when reading the specific view (.option("viewsEnabled", "true")) 
#or set it globally by calling spark.conf.set("viewsEnabled", "true").

#you might get error message for the no permission to bigquery.create 
#    when you access the views from the tables. 
#BigQuery views are not materialized by default, which means that 
#    the connector needs to materialize them before it can read them.
#so you need to add the code: .option('viewMaterializationDataset', 'database1')\

#By default, the materialized views are created in the same project and dataset. 
#Those can be configured by: viewMaterializationProject and viewMaterializationDataset options. 
#These options can also be globally set by calling spark.conf.set(...) before reading the views.

df1=spark.sql(''' select var1,var2, case when ... from df where ...''')
#defined a fuction to load data.
def read_bq_table_spark(dataset, table):        
        df ='bigquery')\
                    .option('table', '{pid}:{ds}.{tbl}'.format(pid=project_id, ds=dataset, tbl=table))\
                    .option('maxParallelism', '56')\
                    .option('viewsEnabled', 'True')\
                    .option('viewMaterializationDataset', 'stg_property_panoramiq')\
                    .option('parentProject', self.project_id)\
        return df
#use sql to select subset in bigquery connector, need updated version for bigquery connector         
#run the following from the  cluster command line: 
pyspark --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.20.0.jar

#if you are using older version: 0.17.3, you migth get those error message: 
#Caused by:; 
#local class incompatible: stream classdesc serialVersionUID = -3988734315685039601,
#then you need to go to the github host to download the new verion jar, and upload to the server. 

#spark.conf.set("viewMaterializationDataset", 'stg_property_panoramiq')
#spark.conf.set("materializationDataset", 'stg_property_panoramiq')

sql = """
  SELECT tag, COUNT(*) c
  FROM (
    SELECT SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(YEAR FROM creation_date)>=2014
  ), UNNEST(tags) tag
  LIMIT 10
df ="bigquery")\
.option('maxParallelism', '56')\
.option('viewsEnabled', 'True')\
.option('viewMaterializationDataset', 'stg_property_panoramiq')\
.option('parentProject', project_id)\

# second option is to use the query option like this:

df ="bigquery").option("query", sql).load()

#defined a fuction to save data back to bigqery.
def write_bq_table_spark(df, dataset, table):

            .option('parentProject', self.project_id)\
            .option('table', '{pid}:{ds}.{tbl}'.format(pid=self.project_id, ds=dataset, tbl=table))\
            .option('TemporaryGcsBucket', tmp_gcs_bucket)\
#note we are trying to upload a pyspark dataframe to bigquery, 
#if you have a hive data, you might need to load the hive table first: 

temp=spark.sql('''select * from {hive_schema}.{table_name} '''.format(hive_schema=schema1,table_name='table1'))
bq.write_bq_table_spark(df=temp, dataset=schema1, table='table1')

In case if you are using GCP bigquery to prepare dataset, here are some examples:

#cumulative summation example in bigquery SQL: 
#1. group by each layer, also count the total.
#2. cummulative summation for each layer.

drop table if exists `prodjectid1.schema1.data_OCSD1`;
create table `prodjectid1.schema1.data_OCSD1` as
SELECT a.cntycd,number_of_stories,b.cnt_cnty,count(*) as cnt  FROM `prodjectid1.schema1.data_OCSD0` as a 

left join ( SELECT cntycd,count(*) as cnt_cnty  FROM `prodjectid1.schema1.data_OCSD10`
           where  living_area_all_buildings<3000 group by cntycd ) as b

on a.cntycd=b.cntycd
where  living_area_all_buildings<3000
group by cntycd,number_of_stories,b.cnt_cnty
order by cntycd,number_of_stories,cnt desc;

#with statement to join with multiple SQL: 
with b as (select situscensid,count(*) as cnt from `project_id1.Dataschema1.table_name1` as a 
where   storiesnbr=2 and zip5='21213' and year5=1915
group by 1)

select a.situscensid,b.cnt,count(*) as cnt1  from `project_id1.Dataschema1.table_name1` as a 

inner join b  on a.situscensid=b.situscensid

where  sumnbrbdrm =2 and zip5='68516' and sumnbrbath=4
group by 1,2
order by 3 desc
limit 20;

--transform row to column in bigquery: 
CALL fhoffa.x.pivot(
  'bigquery-public-data.iowa_liquor_sales.sales' # source table
  , 'fh-bigquery.temp.test' # destination table
  , ['date'] # row_ids
  , 'store_number' # pivot_col_name
  , 'sale_dollars' # pivot_col_value
  , 30 # max_columns
  , 'SUM' # aggregation
  , 'LIMIT 10' # optional_limit

#use cast to convert to string in cumulative sum, otherwise error message. 
SELECT a.cntycd,cnt_cnty,cast(number_of_stories as string) as stories,cnt,
  (PARTITION BY cntycd,cnt_cnty
  ORDER BY number_of_stories ) AS running_sum
from `prodjectid1.schema1.data_OCSD1` as a;
#partition by should be the same across the cummulative sum. 

--check the column data type without clicking the data: 
WHERE  table_name="table_name_c";

#use Bigquery API select statement via unnest for record row: 
FROM `firebase-public-project.analytics_153293282.events_20180915`
WHERE event_name = "level_complete_quickplay"

#You might get error: cannot access field key  on a value with type
SELECT event_name, param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

SELECT event_name, param.value.int_value AS score
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

drop table if exists `project_id1.Dataschema1.table_name1`;
create table `project_id1.Dataschema1.table_name1`
as select distinct var1,var2
from `project_id1.Dataschema1.table_name_a` as a 
inner join `project_id1.Dataschema1.table_name_b`  as trans
on  cast(LPAD(cast(a.cntycd as string), 5, '0') as string)=trans.cntycd ;
--paddle/replace with leading 0 in a string.  

select id1, cnlegaltxt 
from `prodjectid1.schemat1.michale_property` p 
inner join `prodjectid2.schemat2.sam_sdp_master_xref` x 
on x.datasource_pid1_value = p.cntycd and x.datapid2_value= p.pclid and x.datapid3_value =cast(p.pclseqnbr as string)  
and x.datasource_name='tax'
REGEXP_CONTAINS(lower(p.cntylegaltxt), r'single family')   
-- REGEXP_CONTAINS(lower(p.cntylegaltxt), r'condo') or REGEXP_CONTAINS(lower(p.cntylegaltxt), r'condominium')
--where LENGTH (cntylegaltxt) > 5000
limit 50;
#REGEXP_MATCH is not working

using different function in bigquery to convert to datetime: 
timestamp_millis(effectivedt) as date_from_millisecond, --use this one most of time
TIMESTAMP_SECS(effectivedt) as date_from_seconds,
FROM `c***-poc-****.df.seller` 
where effectivedt is not null LIMIT 10;

Sunday, January 3, 2021

Python Study notes: How HDBSCAN works?

HDBSCAN is a clustering algorithm. It extends DBSCAN(Density-Based Spatial Clustering of Applications with Noise)by converting it into a hierarchical clustering algorithm. The main concept of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. Here are some good tutorial. the steps of DBSCAN algorithm:
The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.

If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.

If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.

The above process continues until the density-connected cluster is completely found.

The process restarts with a new point which can be a part of a new cluster or labeled as noise.

The notes are taken from the notebook.
#import package and setup the display
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data
%matplotlib inline
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

#load the data:
moons, _ = data.make_moons(n_samples=50, noise=0.05)
blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)
test_data = np.vstack([moons, blobs])
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)
Time to import the hdbscan package and run the hierarchical clustering algorithm.
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
So now that we have clustered the data -- what actually happened? We can break it out into a series of steps:

Transform the space according to the density/sparsity.
Build the minimum spanning tree of the distance weighted graph.
Construct a cluster hierarchy of connected components.
Condense the cluster hierarchy based on minimum cluster size.
Extract the stable clusters from the condensed tree.

Core distance for a point x is defined for parameter k for a point x and denote as corek(x), via kth nearest neighbor. In other words, draw the circle to cover k nearest data points, the radius of that circle is the core distance.

Mutual reachability distance: now with 2 circles for each data point, find the maximum radius to cover both circles:


#Any point not in a selected cluster is simply a noise point(assigned the label -1)
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat) 
                  if col >= 0 else (0.5, 0.5, 0.5) for col, sat in 
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)                                      
Parameter Selection for HDBSCAN

1. Selecting min_cluster_size: the smallest size grouping that you wish to consider a cluster.

digits = datasets.load_digits()
data =
projection = TSNE().fit_transform(data)
plt.scatter(*projection.T, **plot_kwds)

#start with a min_cluster_size of 15
clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

#Increasing the min_cluster_size to 30 
#reduces the number of clusters, merging some together.
2. Selecting min_samples. The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

Python Study notes: how do we use Underscore(_) in Python

You will find max six different uses of underscore(_) . If you want you can use it for different purposes after you have an idea about unde...