Tuesday, June 2, 2020

GCP Study notes 13: Architecting with Google Kubernetes Engine: Foundations (courseRA notes)

Architecting with Google Compute Engine Specialization: 4 Courses in this Specialization.

1. Google Cloud Platform Fundamentals: Core Infrastructure
2. Architecting with Google Kubernetes Engine: Foundations
3. Architecting with Google Kubernetes Engine: Workloads
4. Architecting with Google Kubernetes Engine: Production

Architecting with Google Compute Engine Specialization: 5 Courses in this Specialization.

1. Google Cloud Platform Fundamentals: Core Infrastructure
2. Essential Google Cloud Infrastructure: Foundation
3. Essential Google Cloud Infrastructure: Core Services
4. Elastic Google Cloud Infrastructure: Scaling and Automation
5. Reliable Google Cloud Infrastructure: Design and Process

What is Kubernetes?

Kubernetes is an orchestration framework for software containers. Containers are a way to package and run code that's more efficient than virtual machines. Kubernetes provides the tools you need to run containerized applications in production and at scale.

What is Google Kubernetes Engine? Google Kubernetes Engine (GKE) is a managed service for Kubernetes.

Kubernetes Engine:A managed environment for deploying containerized applications
Compute Engine: A managed environment for deploying virtual machines
App Engine: A managed serverless platform for deploying applications
Cloud Functions: A managed serverless platform for deploying event-driven functions

An IAM service account is a special type of Google account that belongs to an application or a virtual machine, instead of to an individual end user.

Create an IAM service account: usually select On the Service account permissions page, specify the role as Project > Editor.
Click + Create Key
Select JSON as the key type.
Then A JSON key file is downloaded, we can use that JSON file to upload to VM later.

Task 2. Explore Cloud Shell
Cloud Shell provides the following features and capabilities:
5 GB of persistent disk storage ($HOME dir)
Preinstalled Cloud SDK and other tools
gcloud: for working with Compute Engine, Google Kubernetes Engine (GKE) and many GCP services
gsutil: for working with Cloud Storage
kubectl: for working with GKE and Kubernetes
bq: for working with BigQuery
Language support for Java, Go, Python, Node.js, PHP, and Ruby

Use Cloud Shell to set up the environment variables for this task:

make a storage bucket:
gsutil mb gs://$MY_BUCKET_NAME_2

list of all zones in a given region:
gcloud compute zones list | grep $MY_REGION

##Set this zone to be your default zone
gcloud config set compute/zone $MY_ZONE

#create a vm using command:
gcloud compute instances create $MY_VMNAME \
--machine-type "n1-standard-1" \
--image-project "debian-cloud" \
--image-family "debian-9" \
--subnet "default"

#list of all Vm instances:
gcloud compute instances list

when browser the VM instance, if the external IP address of the first VM you created is shown as a link. This is because you configured this VM's firewall to allow HTTP traffic.

Use the gcloud command line to create a second service account: from the cloud shell:
gcloud iam service-accounts create test-service-account2 --display-name "test-service-account2"

#To grant the second service account the Project viewer role:
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT --member serviceAccount:test-service-account2@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role roles/viewer

GOOGLE_CLOUD_PROJECT is an environment variable that is automatically populated in Cloud Shell and is set to the project ID of the current context.

Task 3. Work with Cloud Storage in Cloud Shell
#Copy a picture of a cat from a Google-provided Cloud Storage bucket
gsutil cp gs://cloud-training/ak8s/cat.jpg cat.jpg
gsutil cp cat.jpg gs://$MY_BUCKET_NAME_1
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg gs://$MY_BUCKET_NAME_2/cat.jpg

#To get the default access list that's been assigned to cat.jpg
gsutil acl get gs://$MY_BUCKET_NAME_1/cat.jpg > acl.txt
cat acl.txt

#Set the access control list for a Cloud Storage object
#To change the object to have private access, execute the following command:
gsutil acl set private gs://$MY_BUCKET_NAME_1/cat.jpg

#To verify the new ACL that's been assigned to cat.jpg
gsutil acl get gs://$MY_BUCKET_NAME_1/cat.jpg > acl-2.txt
cat acl-2.txt

#Authenticate as a service account in Cloud Shell
#to view the current configuration
gcloud config list

#change the authenticated user to the first service account
gcloud auth activate-service-account --key-file credentials.json
gcloud config list
#You should see output for the account is now set to the test-service-account service account.

#To verify the list of authorized accounts in Cloud Shell,
gcloud auth list
#output 2 accounts, one is the student, the other is the service account(active)

#To verify that the current account (test-service-account) cannot access the cat.jpg file in the first bucket that you created:
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg ./cat-copy.jpg
#you should get some error message: AccessDeniedException: 403 HttpError accessing

#but you can copy the one from the 2nd bucket:
gsutil cp gs://$MY_BUCKET_NAME_2/cat.jpg ./cat-copy.jpg

#To switch to the lab account, execute the following command.
gcloud config set account student-02-409a8***d4@qwiklabs.net
#to copy the same cat picture:
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg ./copy2-of-cat.jpg

#Make the first Cloud Storage bucket readable by everyone, including unauthenticated users.
gsutil iam ch allUsers:objectViewer gs://$MY_BUCKET_NAME_1
#This is an appropriate setting for hosting public website content in Cloud Storage.
#even if you switch to the service account earlier, you can still do the copy from bucket1.

#Open the Cloud Shell code editor
#clone a git repository, orchestrate-with-kubernetes folder appears in the left pane of the Cloud Shell code editor window:
git clone https://github.com/googlecodelabs/orchestrate-with-kubernetes.git

#to create a test directory in cloud shell:
mkdir test

#Add the following text as the last line of the cleanup.sh file:
echo Finished cleanup!
#run the code in cloud shell to check the output:
cd orchestrate-with-kubernetes
cat cleanup.sh

#create a new file with index.html:
#replace the URL with actual link.
<img src="REPLACE_WITH_CAT_URL" />

##ssh to the VM, run the code in new window:
#to install the package for VM to host.
sudo apt-get update
sudo apt-get install nginx

#run the code in cloud shell to copy the html file:
gcloud compute scp index.html first-vm:index.nginx-debian.html --zone=us-central1-c

#If you are prompted whether to add a host key to your list of known hosts, answer y.
#If you are prompted to enter a passphrase, press the Enter key to respond with an empty passphrase. Press the Enter key again when prompted to confirm the empty passphrase.

#In the SSH login window for your VM, copy the HTML file from your home directory to the document root of the nginx Web server:
sudo cp index.nginx-debian.html /var/www/html

Monday, June 1, 2020

Python study notes 7: String matching algorithm

What is Levenshtein distance?
Levenshtein distance between two words is the minimum number of single-character edits/transformations (insertions, deletions or substitutions) required to change one word into the other

Levenshtein distance is used mainly to address typos, and I find it pretty much useless if you want to compare two documents for example. That’s where the Cosine similarity comes in. It’s the exact opposite, useless for typo detection, but great for a whole sentence, or document similarity calculation.

pip3 install fuzzywuzzy[speedup]
pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('Levenshtein','Levensthein') #output 2

Levenshtein.distance('This is a foo bar sentence','This sentence is similar to a foo bar sentence') 
#output 20, way too many 

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

You’ll want to construct a vector space from all the ‘sentences’ you want to calculate similarity for. That vector space will have as many dimensions as there are unique words in all sentences combined.

You’ll need the string module to remove punctuations from the string — ‘sentence’ and ‘sentence.’ are different by default, and you want to avoid that. CountVectorizer will take care of converting strings to numerical vectors, which is also neat. Finally, as this article is written in English, you’ll want to remove the most frequent words which give no meaning — they are called stopwords — words like ‘I’, ‘me’, ‘myself’, etc.

If you don’t remove the stopwords you’ll end up with a higher-dimensional space, and the angle between vectors will be greater — implying less similarity — even though the vectors convey pretty much the same information.

import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

def clean_string(text):
  text=''.join([word for word in text if word not in string.punctuation])
  text=' '.join([word for word in text.split() if word not in stopwords])

sentences=['This is a foo bar sentence','This sentence is similar to a foo bar sentence',
'This is another string, but it is not quite similar to the previous one',
'This is just another string']




3. Fuzzywuzzy is a Python library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. There are usually 3 metrics:
ratio, partial ratio, token_sort_ratio.
ratio measure the Levinshtein distance between those 2 strings.
partial ratio measure the Levenshtein distance between shorter string and any portion of the longer string, then find the maximal value out of all possible sub-strings of the longer string.
token_sort_ratio: ignores word order.
token_set_ratio: ignores duplicated words.

from fuzzywuzzy import fuzz

# process is used to compare a string to MULTIPLE other strings
from fuzzywuzzy import process
import pandas as pd
df = pd.read_csv('room_type.csv')

from fuzzywuzzy import fuzz
#ratio , compares the entire string similarity
fuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')
#output: 62, means 62% similar

#partial_ratio , compares partial string similarity.
fuzz.ratio("this is a test", "this is a test with diff!")
#output: 72, by using deletion.
fuzz.partial_ratio("this is a test", "this is a test with diff!")
#output: 100, exactly match with the portion of the longer string.

fuzz.ratio("this is a test", "this is a test with diff!")
#output: 72
fuzz.token_sort_ratio("this is a diff test", "this is a test with diff!")
#output: 88, should be higher than ratio.

fuzz.token_sort_ratio("this is a test with diff!","this is a diff test with diff")
#output: 91
#token_sort_ratio: ignore the order of words
fuzz.token_set_ratio("this is a test with diff!","this is a diff test with diff")
#output: 100, continue ignoring the duplicated words, so higher

choices = ['fuzzy fuzzy was a bear', 'is this a test', 'THIS IS A TEST!!']
process.extract("this is a test", choices, scorer=fuzz.token_sort_ratio)
[('is this a test', 100),
 ('THIS IS A TEST!!', 100),
 ('fuzzy fuzzy was a bear', 28)]

Jaccard Index/Similarity: the num of overlapping/intersection cases/ the number of union cases
Jaccard Distance: 1- Jaccard Index

Hamming Distance: It's to compute the distance of two 1-d arrays u, v.
>>> from scipy.spatial import distance
>>> distance.hamming([1, 0, 0], [0, 1, 0])
>>> distance.hamming([1, 0, 0], [1, 1, 0])
>>> distance.hamming([1, 0, 0], [2, 0, 0])
>>> distance.hamming([1, 0, 0], [3, 0, 0])

Solution #1: Python builtin
use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()

Solution #2: jellyfish library
its a very good library with good coverage and few issues. it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.


>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')

Saturday, May 30, 2020

GCP Study notes 12: example of using GCP Deployment Manager and Stackdriver

Stackdriver is GCP's tool for monitoring, logging and diagnostics. Stackdriver gives you access to many different kinds of signals from your infrastructure platforms, virtual machines, containers, middleware and application tier, logs, metrics and traces. It gives you insight into your application's health, performance and availability. So if issues occur, you can fix them faster. Here are the core components of Stackdriver: Monitoring, Logging, Trace, Error Reporting and Debugging.

Example of using GCP Deployment Manager and Stackdriver

• Confirm the following API are enabled via API & Service module:
Cloud Deployment Manager v2 API
Cloud Runtime Configuration API
Cloud Monitoring API

• Task: Create a Deployment Manager deployment
In GCP console, on the top right toolbar, click the Open Cloud Shell button (Activate Cloud Shell). Click Continue. For your convenience, place the zone that Qwiklabs assigned you to into an environment variable called MY_ZONE. At the Cloud Shell prompt, type this partial command:
export MY_ZONE=

followed by the zone that Qwiklabs assigned you to. Your complete command will look like this:
export MY_ZONE=us-central1-a

At the Cloud Shell prompt, download an editable Deployment Manager template:
gsutil cp gs://cloud-training/gcpfcoreinfra/mydeploy.yaml mydeploy.yaml

Insert your Google Cloud Platform project ID into the file in place of the string PROJECT_ID using this command:
sed -i -e 's/PROJECT_ID/'$DEVSHELL_PROJECT_ID/ mydeploy.yaml

Insert your assigned Google Cloud Platform zone into the file in place of the string ZONE using this command:
sed -i -e 's/ZONE/'$MY_ZONE/ mydeploy.yaml

View the mydeploy.yaml file, with your modifications, with this command:
cat mydeploy.yaml

The file will look something like this:
- name: my-vm
  type: compute.v1.instance
    zone: us-west1-b
    machineType: zones/us-west1-b/machineTypes/n1-standard-1
      - key: startup-script
        value: "apt-get update"
    - deviceName: boot
      type: PERSISTENT
      boot: true
      autoDelete: true
        sourceImage: https://www.googleapis.com/compute/v1/projects/debian-cloud/global/images/debian-9-stretch-v20180806
    - network: https://www.googleapis.com/compute/v1/projects/qwiklabs-gcp-02-asdg/global/networks/default
      - name: External NAT
        type: ONE_TO_ONE_NAT

Build a deployment from the template:
gcloud deployment-manager deployments create my-first-depl --config mydeploy.yaml

When the deployment operation is complete, the gcloud command displays a list of the resources named in the template and their current state.

Confirm that the deployment was successful. In the GCP Console, on the Navigation menu (Navigation menu), click Compute Engine > VM instances called my-vm has been created, as specified by the template.

Click on the VM instance's name to open its VM instance details screen.

Scroll down to the Custom metadata section. Confirm that the startup script you specified in your Deployment Manager template has been installed.

•Task : Update a Deployment Manager deployment
Return to your Cloud Shell prompt. Launch the nano text editor to edit the mydeploy.yaml file:
nano mydeploy.yaml

Find the line that sets the value of the startup script, value: "apt-get update", and edit it so that it looks like this:
value: "apt-get update; apt-get install nginx-light -y"

Do not disturb the spaces at the beginning of the line. The YAML templating language relies on indented lines as part of its syntax. As you edit the file, be sure that the v in the word value in this new line is immediately below the k in the word key on the line above it.

Press Ctrl+O and then press Enter to save your edited file.

Press Ctrl+X to exit the nano text editor.

Return to your Cloud Shell prompt. Enter this command to cause Deployment Manager to update your deployment to install the new startup script:
gcloud deployment-manager deployments update my-first-depl --config mydeploy.yaml

Wait for the gcloud command to display a message confirming that the update operation was completed successfully. In the GCP console, on the Navigation menu (Navigation menu), click Compute Engine > VM instances. Click on the my-vm VM instance's name to open its VM instance details pane. Scroll down to the Custom metadata section.

• Task: View the Load on a VM using Cloud Monitoring, In the GCP Console, on the Navigation menu (Navigation menu), click Compute Engine > VM instances.

To open a command prompt on the my-vm instance, click SSH in its row in the VM instances list.
In the ssh session on my-vm, execute this command to create a CPU load:
dd if=/dev/urandom | gzip -9 >> /dev/null &

This Linux pipeline forces the CPU to work on compressing a continuous stream of random data.

Create a Monitoring workspace
You will now setup a Monitoring workspace that's tied to your Qwiklabs GCP Project. The following steps create a new account that has a free trial of Monitoring.
In the Google Cloud Platform Console, click on Navigation menu > Monitoring.

Wait for your workspace to be provisioned. When the Monitoring dashboard opens, your workspace is ready.
Click on Settings option from the left panel and confirm that the GCP project which Qwiklabs created for you is shown under the GCP Projects section.

Under the Settings tab menu, click Agent. Using your VM's open SSH window and the code shown on the Agents page, install both the Monitoring and Logging agents on your project's VM.

The Monitoring agent is a collectd-based daemon that gathers system and application metrics from virtual machine instances and sends them to Monitoring. By default, the Monitoring agent collects disk, CPU, network, and process metrics. Configuring the Monitoring agent allows third-party applications to get the full list of agent metrics. Learn more

Monitoring agent install script
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh

The Logging agent streams logs from your VM instances and from selected third-party software packages to Logging. It is a best practice to run the Logging agent on all your VM instances. Learn more

Logging agent install script
curl -sSO https://dl.google.com/cloudagents/install-logging-agent.sh
sudo bash install-logging-agent.sh

Sunday, May 17, 2020

GCP Study Notes 11 : Example of deploying APP engine step by step

Lab: GCP Fundamentals: Getting Started with App Engine Toggle Lab Panel

Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Google Cloud Shell provides command-line access to your GCP resources.

#Updates are available for some Cloud SDK components
gcloud components update

#You can list the active account name with this command:
gcloud auth list

Credentialed accounts:
- google1623327_student@qwiklabs.net

#You can list the project ID with this command:
gcloud config list project
Example output:

project = qwiklabs-gcp-44776a13dea667a6

#list all service-accounts available:
gcloud iam service-accounts list
Dataflow service account dataflow-service-account@brcm-edg-dsa-poc-edc.iam.gserviceaccount.com False
Dataproc service account dataproc-service-account@brcm-edg-dsa-poc-edc.iam.gserviceaccount.com False
Service Account used for Kubeflow admin actions. kubeflow-deployment-admin@cbrcm-edg-dsa-poc-edc.iam.gserviceaccount.com False
bigquery-sa bigquery-sa@brcm-edg-dsa-poc-edc.iam.gserviceaccount.com False

#list all users/service-accounts who have been granted any IAM roles on a specified project:
gcloud projects get-iam-policy MY_PROJECT --format=json
sample output:
- members:
- serviceAccount:12345678-compute@developer.gserviceaccount.com
- user:alice@foobar.com
role: roles/editor
- members:
- user:you@yourdomain.com
- user:someoneelse@yourdomain.com
role: roles/owner

#To get the organizaton ID
gcloud organizations list

#list all non-service accounts from the entire GCP organization:
gcloud organizations get-iam-policy ORGANIZATION_ID | grep user\: | sort | uniq

Task 1: Install the Cloud SDK for App Engine

Run the following command to install the gcloud component that includes the App Engine extension for Python 3.7:
gcloud components install app-engine-python
#if got some permisssion error, use the following:
sudo apt-get install google-cloud-sdk-app-engine-python

Initialize your App Engine app with your project and choose its region:
gcloud app create --project=$DEVSHELL_PROJECT_ID
When prompted, select the region where you want your App Engine application located.

Clone the source code repository for a sample application in the hello_world directory:
git clone https://github.com/GoogleCloudPlatform/python-docs-samples

Navigate to the source directory:
cd python-docs-samples/appengine/standard_python37/hello_world

Good video for this tutorial is from the courseRA tutorial:
Week 1: Demonstration: Getting Started with App Engine

Task 2: Run Hello World application locally

In this task, you run the Hello World application in a local, virtual environment in Cloud Shell. Ensure that you are at the Cloud Shell command prompt.Execute the following command to download and update the packages list.

sudo apt-get update

Set up a virtual environment in which you will run your application.
Python virtual environments are used to isolate package installations from the system.

sudo apt-get install virtualenv

virtualenv -p python3 venv

If prompted [Y/n], press Y and then Enter.
Activate the virtual environment.
source venv/bin/activate

#use "ls -l" instead of "list -a" to list all the files
#use "cat requirements.txt" to view the file first.
Navigate to your project directory and install dependencies.
pip install -r requirements.txt

Run the application:
python main.py

In Cloud Shell, click Web preview (Web Preview) > Preview on port 8080 to preview the application.

Task 3: Deploy and run Hello World on App Engine

To deploy your application to the App Engine Standard environment, navigate to the source directory:
cd ~/python-docs-samples/appengine/standard_python37/hello_world

Deploy your Hello World application.
gcloud app deploy

This app deploy command uses the app.yaml file to identify project configuration.
Launch your browser to view the app at http://YOUR_PROJECT_ID.appspot.com
gcloud app browse

GCP provides Deployment Manager to setup your environment. It's an Infrastructure Management Service that automates the creation and management of your Google Cloud Platform resources for you. To use it, you create a template file using either the YAML markup language or Python that describes what you want the components of your environment to look like. Then, you give the template to Deployment Manager, which figures out and does the actions needed to create the environment your template describes. If you need to change your environment, edit your template and then tell Deployment Manager to update the environment to match the change. Here's a tip: you can store and version control your Deployment Manager templates in Cloud Source repositories.

GCP Cloud Function "= =" AWS Lambda Function

AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time you consume - there is no charge when your code is not running. You can use AWS Lambda to run your code in response to events, such as changes to data in an Amazon S3 bucket or an Amazon DynamoDB table; to run your code in response to HTTP requests using Amazon API Gateway

What is the advantage of putting event-driven components of your application into GCP Cloud Functions? Cloud Functions handles scaling these components seamlessly. Your code executes whenever an event triggers it, no matter whether it happens rarely or many times per second. That means you don't have to provision compute resources to handle these operations.

Sunday, May 10, 2020

GCP Study Notes 10 : docker, Containers, Kubernetes, and Kubernetes Engine

An application and its dependencies are called an image. A container is simply a running instance of an image. You need software to build container images and to run them. Docker is one tool that does both. Docker is an open source technology that allows you to create and run applications in containers. But it doesn't offer a way to orchestrate those applications at scale like Kubernetes does.

Containers are not an intrinsic primitive feature of Linux. Instead, their power to isolate workloads is derived from the composition of several technologies:

One foundation is the Linux process. Each Linux process has its own virtual memory address space separate from all others. And Linux processes are rapidly created, and destroyed.

Containers use Linux namespaces to control what an application can see. Process ID numbers, directory trees, IP addresses and more. By the way, Linux namespaces are not the same thing as Kubernetes namespaces, which you'll learn more about later on in this course.

Containers use Linux cgroups to control what an application can use. Its maximum consumption of CPU time, memory, IO bandwidth, and other resources.

Finally, containers use union file systems to efficiently encapsulate applications, and their dependencies into a set of clean minimal layers.

A container image is structured in layers. The tool you use to build the image reads instructions from a file called the container manifest. In the case of a Docker-formatted container image, that's called a Dockerfile. Each instruction in the Dockerfile specifies a layer inside the container image. Each layer is read-only. When a container runs from this image, it will also have a writable ephemeral topmost layer.

We've already discussed Compute Engine, which is GCPs Infrastructure as a Service offering, which lets you run Virtual Machine in the cloud and gives you persistent storage and networking for them,and App Engine, which is one of GCP's platform as a service offerings. Now I'm going to introduce you to a service called Kubernetes Engine. It's like an Infrastructure as a Service offering in that it saves you infrastructure chores. It's also like a platform as a service offering, in that it was built with the needs of developers in mind.

First, I'll tell you about a way to package software called Containers. I'll describe why Containers are useful, and how to manage them in Kubernetes Engine. Let's begin by remembering that infrastructure as a service offering let you share compute resources with others by virtualizing the hardware. Each Virtual Machine has its own instance of an operating system, your choice, and you can build and run applications on it with access to memory, file systems, networking interfaces, and the other attributes that physical computers also have. But flexibility comes with a cost. In an environment like this, the smallest unit of compute is a Virtual Machine together with its application.

The guest OS, that is the operating system maybe large, even gigabytes in size. It can take minutes to boot up. Often it's worth it. Virtual Machine are highly configurable, and you can install and run your tools of choice. So you can configure the underlying system resources such as disks and networking, and you can install your own web server database or a middle ware. But suppose your application is a big success. As demand for it increases, you have to scale out in units of an entire Virtual Machine with a guest operating system for each. That can mean your resource consumption grows faster than you like.

Now, let's make a contrast with a Platform as a Service environment like App Engine. From the perspective of someone deploying on App Engine, it feels very different. Instead of getting a blank Virtual Machine, you get access to a family of services that applications need. So all you do is write your code and self-contained workloads that use these services and include any dependent libraries. As demand for your application increases, the platform scales your applications seamlessly and independently by workload and infrastructure.

This scales rapidly, but you give up control of the underlying server architecture. That's where Containers come in. The idea of a Container is to give you the independent scalability of workloads like you get in a PaaS environment, and an abstraction layer of the operating system and hardware, like you get in an Infrastructure as a Service environment. What do you get as an invisible box around your code and its dependencies with limited access to its own partition of the file system and hardware?

Remember that in Windows, Linux, and other operating systems, a process is an instance of a running program. A Container starts as quickly as a new process. Compare that to how long it takes to boot up an entirely new instance of an operating system. All you need on each host is an operating system that supports Containers and a Container run-time. In essence, you're visualizing the operating system rather than the hardware.

The environment scales like PaaS but gives you nearly the same flexibility as Infrastructure as a Service. The container abstraction makes your code very portable. You can treat the operating system and hardware as a black box. So you can move your code from development, to staging, to production, or from your laptop to the Cloud without changing or rebuilding anything. If you went to scale for example a web server, you can do so in seconds, and deploy dozens or hundreds of them depending on the size of your workload on a single host.

Well, that's a simple example. Let's consider a more complicated case. You'll likely want to build your applications using lots of Containers, each performing their own function, say using the micro-services pattern. The units of code running in these Containers can communicate with each other over a network fabric. If you build this way, you can make applications modular. They deploy it easily and scale independently across a group of hosts.
The host can scale up and down, and start and stop Containers as demand for your application changes, or even as hosts fail and are replaced. A tool that helps you do this well is Kubernetes. Kubernetes makes it easy to orchestrate many Containers on many hosts. Scale them, roll out new versions of them, and even roll back to the old version if things go wrong.

First, I'll show you how you build and run containers. The most common format for Container images is the one defined by the open source tool Docker. In my example, I'll use Docker to bundle an application and its dependencies into a Container. You could use a different tool. For example, Google Cloud offers Cloud Build, a managed service for building Containers. It's up to you.

Here is an example of some code you may have written. It's a Python web application, and it uses the very popular Flask framework. Whenever a web browser talks to it by asking for its top-most document, it replies "hello world". Or if the browser instead appends/version to the request, the application replies with its version. Great. So how do you deploy this application?

It needs a specific version of Python and a specific version of Flask, which we control using Python's requirements.txt file, together with its other dependencies too. So you use a Docker file to specify how your code gets packaged into a Container. For example, Ubuntu is a popular distribution of Linux. Let's start there. You can install Python the same way you would on your development environment. Of course, now that it's in a file, it's repeatable.
What's inside a Dockerfile?

From the Dockerfile above, there are a few commands inside.
The FROM statement starts out by creating a base layer pulled from a public repository. This one happens to be the Ubuntu Linux runtime environment of a specific version. The COPY command adds a new layer, containing some files copied in from your build tools' current directory. The RUN command builds your application using the make command, and puts some result of the build into a third layer. And finally, the last layer specifies what command to run within the container when it's launched. Each layer is only a set of differences from the layer before it. When you write a Dockerfile, you should organize the layers least likely to change through to the layers that are most likely to change.

All changes made to the running container, such as writing new files, modifying existing files and deleting files are written to this thin writable container layer. In the ephemeral, when the container is deleted the contents of this writable layer are lost forever. The underlying container image itself remains unchanged. This fact about containers has an implication for your application design. Whenever you want to store data permanently, you must do so somewhere other than a running container image.

Because each container has its own writable container layer, and all changes are stored in this layer. Multiple containers can share access to the same, underlying image, and yet have their own data state. The diagram here shows multiple containers, showing the same Ubuntu 15.04 image. Because each layer is only a set of differences from the layer before it, you get smaller images. For example, your base application image may be 200 megabytes, but the difference of the next point release might only be 200 kilobytes. When you build a container, instead of copying the whole image, it creates a layer with just the differences. When you run a container, the container run time pulls down the layers it needs. When you update, you only need to copy the difference. This is much faster than running a new virtual machine.

It's very common to use publicly available open source container images as a base for your own images or for unmodified use. For example, you've already seen the Ubuntu container image which provides an Ubuntu Linux environment inside of a container. Alpine is popular Linux environment and a container, noted for being very, very small. The NGINX web server is frequently used in its container packaging.

Google maintains a container registry, gcr.io. This registry contains many public open source images. And Google Cloud customers also use it to store their own private images in a way that integrates well with Cloud IAM.

To generate a build with Cloud Build, you define a series of steps. For example, you can configure build steps to fetch dependencies, compile source code, run integration tests or use tools such as Dock or Cradle and Maven. Each build step in Cloud Build runs in a Docker container.

Then Cloud Build can deliver your newly built images to various execution environments. Not only GKE but also App Engine, and Cloud Functions.

Let's copy in the requirements.txt file we created earlier, and use it to install our applications dependencies. We'll also copy in the files that make up our application and tell the environment that launches this Container how to run it. Then I use the docker build command to build the Container. This builds the Container and stores it on the local system as a runnable image. Then I can use the docker run command to run the image. In a real-world situation, you'd probably upload the image to a Container Registry service, such as the Google Container Registry and share or download it from there. Great, we packaged an application, but building a reliable, scalable, distributed system takes a lot more. How about application configuration, service discovery, managing updates, and monitoring?

Kubernetes is an open-source orchestrator for containers so you can better manage and scale your applications. Kubernetes offers an API that lets people, that is authorized people, not just anybody, control its operation through several utilities.

In Kubernetes, a node represents a computing instance. In Google Cloud, nodes are virtual machines running in Compute Engine.

#code to create a Kebernetes cluster
gcloud container clusters create k1

Whenever Kubernetes deploys a container or a set of related containers, it does so inside an abstraction called a pod. A pod is the smallest deployable unit in Kubernetes. Think of a pod as if it were a running process on your cluster. It could be one component of your application or even an entire application.

It's common to have only one container per pod. But if you have multiple containers with a hard dependency, you can package them into a single pod. They'll automatically share networking and they can have disk storage volumes in common. Each pod in Kubernetes gets a unique IP address and set of ports for your containers. Because containers inside a pod can communicate with each other using the localhost network interface, they don't know or care which nodes they're deployed on.

Running the kubectl run command starts a deployment with a container running a pod. In this example, the container running inside the pod is an image of the popular nginx open source web server. The kubectl command is smart enough to fetch an image of nginx of the version we request from a container registry. So what is a deployment? A deployment represents a group of replicas of the same pod. It keeps your pods running even if a node on which some of them run on fails. You can use a deployment to contain a component of your application or even the entire application.

To see the running nginx pods, run the command kubectl get pods. By default, pods in a deployment or only accessible inside your cluster, but what if you want people on the Internet to be able to access the content in your nginx web server? To make the pods in your deployment publicly available, you can connect a load balancer to it by running the kubectl expose command. Kubernetes then creates a service with a fixed IP address for your pods. A service is the fundamental way Kubernetes represents load balancing.

To be specific, you requested Kubernetes to attach an external load balancer with a public IP address to your service so that others outside the cluster can access it. In GKE, this kind of load balancer is created as a network load balancer. This is one of the managed load balancing services that Compute Engine makes available to virtual machines. GKE makes it easy to use it with containers. Any client that hits that IP address will be routed to a pod behind the service. In this case, there is only one pod, your simple nginx pod.

So what exactly is a service? A service groups a set of pods together and provides a stable endpoint for them. In our case, a public IP address managed by a network load balancer, although there are other choices. But why do you need a service? Why not just use pods' IP addresses directly? Suppose instead your application consisted of a front end and a back end. Couldn't the front end just access the back end using those pods' internal IP addresses without the need for a service? Yes, but it would be a management problem. As deployments create and destroy pods, pods get their own IP addresses, but those addresses don't remain stable over time. Services provide that stable endpoint you need.

What if you need more power? To scale a deployment, run the kubectl scale command. Now our deployment has 3 nginx web servers, but they're all behind the service and they're all available through one fixed IP address. You could also use auto scaling with all kinds of useful parameters.

Instead of issuing commands, you provide a configuration file that tells Kubernetes what you want your desired state to look like and Kubernetes figures out how to do it. These configuration files then become your management tools. To make a change, edit the file and then present the changed version to Kubernetes. The command on the slide is one way we could get a starting point for one of these files based on the work we've already done.

Lab: building and running containerized applications, orchestrating and scaling them on a cluster. Finally, deploying them using rollouts.
1. need to enable Kubernetes Engine API & Container Registry API.
#setup the environment variable called MY_ZONE:  
export MY_ZONE=us-central1-a

#Start a Kubernetes cluster managed by Kubernetes Engine, 
#Name the cluster webfrontend and configure it to run 2 nodes:
gcloud container clusters create webfrontend --zone $MY_ZONE --num-nodes 2
#those 2 nodes are VMs, you can see that in the VM API. 

#check the version of kubernetes: 
kubectl version

#launch a single instance of nginx container. (Nginx is a popular web server.)
kubectl create deploy nginx --image=nginx:1.17.10
#create a deployment consisting of a single pod containing the nginx container. 

#View the pod running the nginx container:
kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-6cc5778b4d-qdch4   1/1     Running   0          75s

#Expose the nginx container to the Internet:
kubectl expose deployment nginx --port 80 --type LoadBalancer
#Kubernetes created a service and an external load balancer with a public IP address attached to it. 
#The IP address remains the same for the life of the service.

#View the new service:
kubectl get services
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP             443/TCP        18m
nginx        LoadBalancer   80:31318/TCP   51s

#Scale up the number of pods to 3 running on your service:
kubectl scale deployment nginx --replicas 3

kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-6cc5778b4d-bp9fr   1/1     Running   0          68s
nginx-6cc5778b4d-qdch4   1/1     Running   0          7m16s
nginx-6cc5778b4d-st48z   1/1     Running   0          68s

#double check the external IP is not changed: 
kubectl get services
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP             443/TCP        23m
nginx        LoadBalancer   80:31318/TCP   4m54s
We've discussed two GCP products that provide the compute infrastructure for applications: Compute Engine and Kubernetes Engine.
The App Engine platform manages the hardware and networking infrastructure required to run your code. To deploy an application on App Engine, you just hand App Engine your code and the App Engine service takes care of the rest. App Engine provides you with a built-in services that many web applications need. NoSQL databases, in-memory caching, load balancing, health checks, logging and a way to authenticate users. App engine will scale your application automatically in response to the amount of traffic it receives. So you only pay for those resources you use. There are no servers for you to provision or maintain. That's why App Engine is especially suited for applications where the workload is highly variable or unpredictable like web applications and mobile backend. App Engine offers two environments: standard and flexible.

Google App Engine Standard Environment: Of the two App Engine Environments, Standard is the simpler. It offers a simpler deployment experience than the Flexible environment and fine-grained auto-scale. Like the Standard Environment, it also offers a free daily usage quota for the use of some services. What's distinctive about the Standard Environment though, is that low utilization applications might be able to run at no charge.

Google provides App Engine software development kits in several languages, so that you can test your application locally before you upload it to the real App Engine service. The SDKs also provide simple commands for deployment. Now, you may be wondering what does my code actually run on? I mean what exactly is the executable binary? App Engine's term for this kind of binary is the runtime.

In App Engine Standard Environment, you use a runtime provided by Google. We'll see your choices shortly. App Engine Standard Environment provides runtimes for specific versions of Java, Python, PHP and Go. The runtimes also include libraries that support App Engine APIs. And for many applications, the Standard Environment runtimes and libraries may be all you need. If you want to code in another language, Standard Environment is not right for you.

You'll want to consider the Flexible Environment. The Standard Environment also enforces restrictions on your code by making it run in a so-called "Sandbox." That's a software construct that's independent of the hardware, operating system, or physical location of the server it runs on. The Sandbox is one of the reasons why App Engine Standard Environment can scale and manage your application in a very fine-grained way.

Like all Sandboxes, it imposes some constraints. For example, your application can't write to the local file system. It'll have to write to a database service instead if it needs to make data persistent. Also, all the requests your application receives has a 60-second timeout, and you can't install arbitrary third party software. If these constraints don't work for you, that would be a reason to choose the Flexible Environment.

Instead of the sandbox, App Engine flexible environment lets you specify the container your App Engine runs in. Yes, containers. Your application runs inside Docker containers on Google Compute Engine Virtual Machines, VMs. App Engine manages these Compute Engine machines for you. They're health checked, healed as necessary, and you get to choose which geographical region they run in, and critical backward-compatible updates to their operating systems are automatically applied. All this so that you can just focus on your code. App Engine flexible environment apps use standard run times, can access App Engine services such as data store, memcached, task queues, and so on.

Notice that Standard environment starts up instances of your application faster, but that you get less access to the infrastructure in which your application runs. For example, Flexible environment lets you SSH into the virtual machines on which your application runs. It lets you use local disk for scratch base, it lets you install third-party software, and it lets your application make calls to the network without going through App Engine. On the other hand, Standard environment's billing can drop to zero for the completely idle application.

App Engine standard environment is for people who want the service to take maximum control of their application's deployment and scaling. Kubernetes Engine gives the application owner the full flexibility of Kubernetes. App Engine flexible edition is somewhere in between.

App Engine environment treats containers as a means to an end, but for Kubernetes Engine, containers are a fundamental organizing principle.
You'd like for the API to have a single coherent way for it to know which end user is making the call. That's when you use Cloud Endpoints. It implements these capabilities and more using an easy to deploy proxy in front of your software service, and it provides an API console to wrap up those capabilities in an easy-to-manage interface. Cloud Endpoints supports applications running in GCP's compute platforms in your choice of languages and your choice of client technologies.

Apigee Edge is also a platform for developing and managing API proxies. It has a focus on business problems like rate limiting, quotas, and analytics. Many users of Apigee Edge are providing a software service to other companies and those features come in handy.Because of the backend services for Apigee Edge need not be in GCP, engineers often use it when they are "taking apart" a legacy application. Instead of replacing a monolithic application in one risky move, they can instead use Apigee Edge to peel off its services one by one, standing up microservices to implement each in turn, until the legacy application can be finally retired.

Friday, May 8, 2020

GCP Study Notes 9: Architecting with Google Kubernetes Engine Specialization (coursera notes)

4 courses from Architecting with Google Kubernetes Engine Specialization:
1. Google Cloud Platform Fundamentals: Core Infrastructure

Infrastructure as a Service, IaaS, and Platform as a Service, PaaS offerings. IaaS offerings provide raw compute, storage, and network organized in ways that are familiar from data centers. PaaS offerings, on the other hand, bind application code you write to libraries that give access to the infrastructure your application needs. That way, you can just focus on your application logic.

In the IaaS model, you pay for what you allocate. In the PaaS model, you pay for what you use. Both sure beat the old way where you bought everything in advance based on lots of risky forecasting. What about SaaS? software as a Service.

Example: use Cloud Launcher to deploy a solution on Google Cloud platform. The solution I've chosen is a LAMP stack. LAMP stands for Linux(Operating System), Apache HTTP Server(web server), MySQL(relation database), PHP(Web application framework). It's an easy environment for developing web applications. I'll use Cloud Launcher to deploy that Stack into a Compute Engine Instance.

use gcloud shell to Create VM: 
gcloud compute zones list | grep us-central1
-—set up zone for the vm: 
gcloud config set compute/zone us-central1-b
—create vm: 
gcloud compute instances create "my-vm-2” \
--machine-type "n1-standard-1” \
--image-project "debian-cloud" \ 
--image "debian-9-stretch-v20190213” \
--subnet "default"

--#Connect between VM instances, visit vm2 from vm1: 
click ssh on the vm-2: 
ping my-vm-1
ssh my-vm-1
sudo apt-get install nginx-light -y
sudo nano /var/www/html/index.nginx-debian.html

curl http://localhost/

curl http://my-vm-1/

Bigtable is actually the same database that powers many of Google's core services including search, analytics, maps and Gmail.

Cloud SQL provides several replica services like read, failover, and external replicas. This means that if an outage occurs, Cloud SQL can replicate data between multiple zones with automatic failover. Cloud SQL also helps you backup your data with either on-demand or scheduled backups. It can also scale both vertically by changing the machine type, and horizontally via read replicas. From a security perspective, Cloud SQL instances include network firewalls, and customer data is encrypted when on Google's internal networks, and when stored in database tables, temporary files, and backups. If Cloud SQL does not fit your requirements because you need horizontal scaleability, consider using Cloud Spanner.
Here are more speicific difference in terms of capacity and use case type:

It offers transactional consistency at a global scale, schemas, SQL, and automatic synchronous replication for high availability. And, it can provide pedabytes of capacity. Consider using Cloud Spanner if you have outgrown any relational database, or sharding your databases for throughput high performance, need transactional consistency, global data and strong consistency, or just want to consolidate your database. Natural use cases include, financial applications, and inventory applications.

We already discussed one GCP NoSQL database service: Cloud Bigtable. Another highly scalable NoSQL database choice for your applications is Cloud Datastore. One of its main use cases is to store structured data from App Engine apps. You can also build solutions that span App Engine and Compute Engine with Cloud Datastore as the integration point.

Cloud Datastore: Structured objects, with transactions and SQL-like queries
Cloud Spanner: A relational database with SQL queries and horizontal scalability.
Cloud Bigtable: Structured objects, with lookups based on a single key
Cloud Storage: Immutable binary objects

Example of create webhost using storage and VM:

Task 2: Deploy a web server VM instance

  1. In the GCP Console, on the Navigation menu, click Compute Engine > VM instances.
  2. Click Create.
  3. On the Create an Instance page, for Name, type bloghost
  4. For Region and Zone, select the region and zone assigned by Qwiklabs.
  5. For Machine type, accept the default.
  6. For Boot disk, if the Image shown is not Debian GNU/Linux 9 (stretch), click Change and select Debian GNU/Linux 9 (stretch).
  7. Leave the defaults for Identity and API access unmodified.
  8. For Firewall, click Allow HTTP traffic.
  9. Click Management, security, disks, networking, sole tenancy to open that section of the dialog.
  10. Enter the following script as the value for Startup script:
apt-get update
apt-get install apache2 php php-mysql -y
service apache2 restart
  1. Leave the remaining settings as their defaults, and click Create.

Task 3: Create a Cloud Storage bucket using gsutil command

All Cloud Storage bucket names must be globally unique. To ensure that your bucket name is unique, these instructions will guide you to give your bucket the same name as your Cloud Platform project ID, which is also globally unique.
Cloud Storage buckets can be associated with either a region or a multi-region location: US, EU, or ASIA. In this activity, you associate your bucket with the multi-region closest to the region and zone that Qwiklabs or your instructor assigned you to.
  1. On the Google Cloud Platform menu, click Activate Cloud Shell. If a dialog box appears, click Start Cloud Shell.
  2. For convenience, enter your chosen location into an environment variable called LOCATION. Enter one of these commands:
  1. In Cloud Shell, the DEVSHELL_PROJECT_ID environment variable contains your project ID. Enter this command to make a bucket named after your project ID:
  1. Retrieve a banner image from a publicly accessible Cloud Storage location:
gsutil cp gs://cloud-training/gcpfci/my-excellent-blog.png my-excellent-blog.png
  1. Copy the banner image to your newly created Cloud Storage bucket:
gsutil cp my-excellent-blog.png gs://$DEVSHELL_PROJECT_ID/my-excellent-blog.png
  1. Modify the Access Control List of the object you just created so that it is readable by everyone:
gsutil acl ch -u allUsers:R gs://$DEVSHELL_PROJECT_ID/my-excellent-blog.png

Task 4: Create the Cloud SQL instance

  1. In the GCP Console, on the Navigation menu, click Storage > SQL.
  2. Click Create instance.
  3. For Choose a database engine, select MySQL.
  4. For Instance ID, type blog-db, and for Root password type a password of your choice.

  1. Set the region and zone assigned by Qwiklabs.

  1. Click Create.

  1. Click on the name of the instance, blog-db, to open its details page.

  2. From the SQL instances details page, copy the Public IP address for your SQL instance to a text editor for use later in this lab.
  3. Click the Users tab, and then click Create user account.
  4. For User name, type blogdbuser
  5. For Password, type a password of your choice. Make a note of it.
  6. Click Create to create the user account in the database.

  1. Click the Connections tab, and then click Add network.

  1. For Name, type web front end
  2. For Network, type the external IP address of your bloghost VM instance, followed by /32
The result will look like this:

  1. Click Done to finish defining the authorized network.
  2. Click Save to save the configuration change.

Task 5: Configure an application in a Compute Engine instance to use Cloud SQL

  1. On the Navigation menu, click Compute Engine > VM instances.
  2. In the VM instances list, click SSH in the row for your VM instance bloghost.
  3. In your ssh session on bloghost, change your working directory to the document root of the web server:
cd /var/www/html
  1. Use the nano text editor to edit a file called index.php:
sudo nano index.php
  1. Paste the content below into the file:

<head><title>Welcome to my excellent blog</title></head>
<h1>Welcome to my excellent blog</h1>
 $dbserver = "CLOUDSQLIP";
$dbuser = "blogdbuser";
$dbpassword = "DBPASSWORD";
// In a production blog, we would not store the MySQL
// password in the document root. Instead, we would store it in a
// configuration file elsewhere on the web server VM instance.

$conn = new mysqli($dbserver, $dbuser, $dbpassword);

if (mysqli_connect_error()) {
        echo ("Database connection failed: " . mysqli_connect_error());
} else {
        echo ("Database connection succeeded.");

  1. Press Ctrl+O, and then press Enter to save your edited file.
  2. Press Ctrl+X to exit the nano text editor.
  3. Restart the web server:
sudo service apache2 restart
  1. Open a new web browser tab and paste into the address bar your bloghost VM instance's external IP address followed by /index.php. The URL will look like this:

When you load the page, you will see that its content includes an error message beginning with the words:
Database connection failed: ...

  1. Return to your ssh session on bloghost. Use the nano text editor to edit index.php again.
sudo nano index.php
  1. In the nano text editor, replace CLOUDSQLIP with the Cloud SQL instance Public IP address that you noted above. Leave the quotation marks around the value in place.
  2. In the nano text editor, replace DBPASSWORD with the Cloud SQL database password that you defined above. Leave the quotation marks around the value in place.
  3. Press Ctrl+O, and then press Enter to save your edited file.
  4. Press Ctrl+X to exit the nano text editor.
  5. Restart the web server:
sudo service apache2 restart
  1. Return to the web browser tab in which you opened your bloghost VM instance's external IP address. When you load the page, the following message appears:
Database connection succeeded.

Task 6: Configure an application in a Compute Engine instance to use a Cloud Storage object

  1. In the GCP Console, click Storage > Browser.
  2. Click on the bucket that is named after your GCP project.
  3. In this bucket, there is an object called my-excellent-blog.png. Copy the URL behind the link icon that appears in that object's Public access column, or behind the words "Public link" if shown.

  1. Return to your ssh session on your bloghost VM instance.
  2. Enter this command to set your working directory to the document root of the web server:
cd /var/www/html
  1. Use the nano text editor to edit index.php:
sudo nano index.php
  1. Use the arrow keys to move the cursor to the line that contains the h1 element. Press Enter to open up a new, blank screen line, and then paste the URL you copied earlier into the line.
  2. Paste this HTML markup immediately before the URL:
<img src='
  1. Place a closing single quotation mark and a closing angle bracket at the end of the URL:
The resulting line will look like this:
<img src='https://storage.googleapis.com/qwiklabs-gcp-0005e186fa559a09/my-excellent-blog.png'>
The effect of these steps is to place the line containing <img src='...'> immediately before the line containing <h1>...</h1>

  1. Press Ctrl+O, and then press Enter to save your edited file.
  2. Press Ctrl+X to exit the nano text editor.
  3. Restart the web server:

Thursday, April 30, 2020

GCP Study Notes 8: step by step example to develop and deploy model using Bigquery and DataFlow

We will use step by step example to cover how to develop and deploy model using Bigquery and DataFlow.

The more sophisticated your models are, the more struggles you’ll face when it comes to production. Did you ever regret ensembling 5 different models when developing a customer churn classifier? Don’t worry, Apache Beam comes to rescue.

Why do we need to use Apache Beam via google cloud dataflow? Inituitively, Before getting started with Apache Beam, let’s see what options you have to operate your model.

The 1st option is creating a single virtual machine (VM) on the cloud for computing tasks. Fair enough, but setting up and managing that VM would be a headache cause it requires a lot of manual work (your cloud engineers will not like it!).

2. Cloud Dataproc helps to release the managing demands, and is another good option to consider as it provides computing resources that only live for the duration of one run. However, you’ll need to spend some time converting your Python code into PySpark or Scala, not to mention that you may not able to fully replicate what you did in Python with these programming languages.

3. Dataflow is likely to be a good option, since:
You can write Python3 using Apache Beam Python SDK to create a data pipeline that runs on the DataflowRunner backend. What does that mean? It means you can write Python3 to construct your pipeline, and operate your ML models built on Python directly without converting them into Scala or PySpark.

Dataflow is a fully managed service that minimises latency, processing time, and cost through autoscaling worker resources. Apache Beam is an open-source, unified model that allows users to build a program by using one of the open-source Beam SDKs (Python is one of them) to define data processing pipelines. The pipeline is then translated by Beam Pipeline Runners to be executed by distributed processing backends, such as Google Cloud Dataflow.

#1. Creating datasets for Machine Learning using Dataflow
#run the following twice in notebook to avoid oauth2client error
pip install --user apache-beam[gcp]
#Note:  You may ignore the following responses in the cell output above:
#ERROR (in Red text) related to: witwidget-gpu, fairing
#WARNING (in Yellow text) related to: hdfscli, hdfscli-avro, pbr, fastavro, gen_client
#Restart the kernel before proceeding further (On the Notebook menu - Kernel - Restart Kernel).

import apache_beam as beam

# change these to try this notebook out
BUCKET = 'qwiklabs-gcp-04-********-vcm'
PROJECT = 'qwiklabs-gcp-04-********'
REGION = 'us-central1'

import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}

# Create SQL query using natality data after the year 2000
query = """
WHERE year > 2000

# Call BigQuery and examine in dataframe
from google.cloud import bigquery
df = bigquery.Client().query(query + " LIMIT 100").to_dataframe()

Apache Beam is an open source, unified model for defining both batch- and streaming-data parallel-processing pipelines. The Apache Beam programming model simplifies the mechanics of large-scale data processing. The Apache Beam model provides useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding datasets.
import datetime, os

def to_csv(rowdict):
  # Pull columns from BQ and create a line
  import hashlib
  import copy
  CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks'.split(',')

  # Create synthetic data where we assume that no ultrasound has been performed
  # and so we don't know sex of the baby. Let's assume that we can tell the difference
  # between single and multiple, but that the errors rates in determining exact number
  # is difficult in the absence of an ultrasound.
  no_ultrasound = copy.deepcopy(rowdict)
  w_ultrasound = copy.deepcopy(rowdict)

  no_ultrasound['is_male'] = 'Unknown'
  if rowdict['plurality'] > 1:
    no_ultrasound['plurality'] = 'Multiple(2+)'
    no_ultrasound['plurality'] = 'Single(1)'

  # Change the plurality column to strings
  w_ultrasound['plurality'] = ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)'][rowdict['plurality'] - 1]

  # Write out two rows for each input row, one with ultrasound and one without
  for result in [no_ultrasound, w_ultrasound]:
    data = ','.join([str(result[k]) if k in result else 'None' for k in CSV_COLUMNS])
    key = hashlib.sha224(data.encode('utf-8')).hexdigest()  # hash the columns to form a key
    yield str('{},{}'.format(data, key))
def preprocess(in_test_mode):
  import shutil, os, subprocess
  job_name = 'preprocess-babyweight-features' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')

  if in_test_mode:
      print('Launching local job ... hang on')
      OUTPUT_DIR = './preproc'
      shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
      print('Launching Dataflow job {} ... hang on'.format(job_name))
      OUTPUT_DIR = 'gs://{0}/babyweight/preproc/'.format(BUCKET)
        subprocess.check_call('gsutil -m rm -r {}'.format(OUTPUT_DIR).split())

  options = {
      'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
      'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
      'job_name': job_name,
      'project': PROJECT,
      'region': REGION,
      'teardown_policy': 'TEARDOWN_ALWAYS',
      'no_save_main_session': True,
      'num_workers': 4,
      'max_num_workers': 5
  opts = beam.pipeline.PipelineOptions(flags = [], **options)
  if in_test_mode:
      RUNNER = 'DirectRunner'
      RUNNER = 'DataflowRunner'
  p = beam.Pipeline(RUNNER, options = opts)
  query = """
WHERE year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0

  if in_test_mode:
    query = query + ' LIMIT 100' 

  for step in ['train', 'eval']:
    if step == 'train':
      selquery = 'SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 4)) < 3'.format(query)
      selquery = 'SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 4)) = 3'.format(query)

     | '{}_read'.format(step) >> beam.io.Read(beam.io.BigQuerySource(query = selquery, use_standard_sql = True))
     | '{}_csv'.format(step) >> beam.FlatMap(to_csv)
     | '{}_out'.format(step) >> beam.io.Write(beam.io.WriteToText(os.path.join(OUTPUT_DIR, '{}.csv'.format(step))))

  job = p.run()
  if in_test_mode:
preprocess(in_test_mode = False)
Example code to backup a BigQuery table via gcs bucket:
If an ETL job goes bad and you want to revert back to yesterday’s data, you can simply do:
CREATE OR REPLACE TABLE dataset.table_restored
FROM dataset.table
#However, time travel is restricted to 7 days. 

bq show --schema dataset.table. # schema.json
bq --format=json show dataset.table.  # tbldef.json
bq extract --destination_format=AVRO \
           dataset.table gs://.../data_*.avro # AVRO files
bq load --source_format=AVRO \
    --time_partitioning_expiration ... \
    --time_partitioning_field ... \
    --time_partitioning_type ... \
    --clustering_fields ... \
    --schema ... \
    todataset.table_name \
# backup: If you do not need the backup to be in the form of files, 
# a much simpler way to backup your BigQuery table is use bq cp to backup:
bq mk dataset_${date}
bq cp dataset.table dataset_${date}.table
# restore
bq cp dataset_20200301.table dataset_restore.table    

GCP Study notes 13: Architecting with Google Kubernetes Engine: Foundations (courseRA notes)

Architecting with Google Compute Engine Specialization : 4 Courses in this Specialization. 1. Google Cloud Platform Fundamentals: Core In...