Tuesday, June 2, 2020

GCP Study notes 13: Architecting with Google Kubernetes Engine: Foundations (courseRA notes)

Architecting with Google Compute Engine Specialization: 4 Courses in this Specialization.

1. Google Cloud Platform Fundamentals: Core Infrastructure
2. Architecting with Google Kubernetes Engine: Foundations
3. Architecting with Google Kubernetes Engine: Workloads
4. Architecting with Google Kubernetes Engine: Production

Architecting with Google Compute Engine Specialization: 5 Courses in this Specialization.

1. Google Cloud Platform Fundamentals: Core Infrastructure
2. Essential Google Cloud Infrastructure: Foundation
3. Essential Google Cloud Infrastructure: Core Services
4. Elastic Google Cloud Infrastructure: Scaling and Automation
5. Reliable Google Cloud Infrastructure: Design and Process

What is Kubernetes?

Kubernetes is an orchestration framework for software containers. Containers are a way to package and run code that's more efficient than virtual machines. Kubernetes provides the tools you need to run containerized applications in production and at scale.

What is Google Kubernetes Engine? Google Kubernetes Engine (GKE) is a managed service for Kubernetes.

Kubernetes Engine:A managed environment for deploying containerized applications
Compute Engine: A managed environment for deploying virtual machines
App Engine: A managed serverless platform for deploying applications
Cloud Functions: A managed serverless platform for deploying event-driven functions
GCP Cloud Function "= =" AWS Lambda Function: drop a file in the folder to trigger some process to run.

An IAM service account is a special type of Google account that belongs to an application or a virtual machine, instead of to an individual end user.

Create an IAM service account: usually select On the Service account permissions page, specify the role as Project > Editor.
Click + Create Key
Select JSON as the key type.
Then A JSON key file is downloaded, we can use that JSON file to upload to VM later.

Task 2. Explore Cloud Shell
Cloud Shell provides the following features and capabilities:
5 GB of persistent disk storage ($HOME dir)
Preinstalled Cloud SDK and other tools
gcloud: for working with Compute Engine, Google Kubernetes Engine (GKE) and many GCP services
gsutil: for working with Cloud Storage
kubectl: for working with GKE and Kubernetes: most important command for GKE!
bq: for working with BigQuery
kubectl is used to create, update, and delete Kubernetes resources like pods, deployments, and load balancers. kubectl can’t be used to directly provision the nodes or clusters your pods are run on. This is because Kubernetes was designed to be platform agnostic(platform non-specific). Kubernetes doesn’t know or care where it is running, so there is no built in way for it to communicate with your chosen cloud provider to rent nodes on your behalf. Because we are using Google Kubernetes Engine for this tutorial, we will need to use the gcloud command for these tasks.

A node is the smallest unit of computing hardware in Kubernetes. One node == One single machine in your cluster!

Programs running on Kubernetes are packaged as Linux containers. Containers are a widely accepted standard, so there are already many pre-built images that can be deployed on Kubernetes.

Use Cloud Shell to set up the environment variables for this task:

make a storage bucket:
gsutil mb gs://$MY_BUCKET_NAME_2

list of all zones in a given region:
gcloud compute zones list | grep $MY_REGION

##Set this zone to be your default zone
gcloud config set compute/zone $MY_ZONE

#create a vm using command:
gcloud compute instances create $MY_VMNAME \
--machine-type "n1-standard-1" \
--image-project "debian-cloud" \
--image-family "debian-9" \
--subnet "default"

#list of all Vm instances:
gcloud compute instances list

when browser the VM instance, if the external IP address of the first VM you created is shown as a link. This is because you configured this VM's firewall to allow HTTP traffic.

Use the gcloud command line to create a second service account: from the cloud shell:
gcloud iam service-accounts create test-service-account2 --display-name "test-service-account2"

#To grant the second service account the Project viewer role:
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT --member serviceAccount:test-service-account2@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role roles/viewer

GOOGLE_CLOUD_PROJECT is an environment variable that is automatically populated in Cloud Shell and is set to the project ID of the current context.

Task 3. Work with Cloud Storage in Cloud Shell
#Copy a picture of a cat from a Google-provided Cloud Storage bucket
gsutil cp gs://cloud-training/ak8s/cat.jpg cat.jpg
gsutil cp cat.jpg gs://$MY_BUCKET_NAME_1
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg gs://$MY_BUCKET_NAME_2/cat.jpg

#To get the default access list that's been assigned to cat.jpg
gsutil acl get gs://$MY_BUCKET_NAME_1/cat.jpg > acl.txt
cat acl.txt

#Set the access control list for a Cloud Storage object
#To change the object to have private access, execute the following command:
gsutil acl set private gs://$MY_BUCKET_NAME_1/cat.jpg

#To verify the new ACL that's been assigned to cat.jpg
gsutil acl get gs://$MY_BUCKET_NAME_1/cat.jpg > acl-2.txt
cat acl-2.txt

#Authenticate as a service account in Cloud Shell
#to view the current configuration
gcloud config list

#change the authenticated user to the first service account
gcloud auth activate-service-account --key-file credentials.json
gcloud config list
#You should see output for the account is now set to the test-service-account service account.

#To verify the list of authorized accounts in Cloud Shell,
gcloud auth list
#output 2 accounts, one is the student, the other is the service account(active)

#To verify that the current account (test-service-account) cannot access the cat.jpg file in the first bucket that you created:
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg ./cat-copy.jpg
#you should get some error message: AccessDeniedException: 403 HttpError accessing

#but you can copy the one from the 2nd bucket:
gsutil cp gs://$MY_BUCKET_NAME_2/cat.jpg ./cat-copy.jpg

#To switch to the lab account, execute the following command.
gcloud config set account student-02-409a8***d4@qwiklabs.net
#to copy the same cat picture:
gsutil cp gs://$MY_BUCKET_NAME_1/cat.jpg ./copy2-of-cat.jpg

#Make the first Cloud Storage bucket readable by everyone, including unauthenticated users.
gsutil iam ch allUsers:objectViewer gs://$MY_BUCKET_NAME_1
#This is an appropriate setting for hosting public website content in Cloud Storage.
#even if you switch to the service account earlier, you can still do the copy from bucket1.

#Open the Cloud Shell code editor
#clone a git repository, orchestrate-with-kubernetes folder appears in the left pane of the Cloud Shell code editor window:
git clone https://github.com/googlecodelabs/orchestrate-with-kubernetes.git

#to create a test directory in cloud shell:
mkdir test

#Add the following text as the last line of the cleanup.sh file:
echo Finished cleanup!
#run the code in cloud shell to check the output:
cd orchestrate-with-kubernetes
cat cleanup.sh

#create a new file with index.html:
#replace the URL with actual link.
<img src="REPLACE_WITH_CAT_URL" />

##ssh to the VM, run the code in new window:
#to install the package for VM to host.
sudo apt-get update
sudo apt-get install nginx

#run the code in cloud shell to copy the html file:
gcloud compute scp index.html first-vm:index.nginx-debian.html --zone=us-central1-c

#If you are prompted whether to add a host key to your list of known hosts, answer y.
#If you are prompted to enter a passphrase, press the Enter key to respond with an empty passphrase. Press the Enter key again when prompted to confirm the empty passphrase.

#In the SSH login window for your VM, copy the HTML file from your home directory to the document root of the nginx Web server:
sudo cp index.nginx-debian.html /var/www/html

Imperative object configuration
In imperative object configuration, the kubectl command specifies the operation (create, replace, etc.), optional flags and at least one file name. The file specified must contain a full definition of the object in YAML or JSON format. Declarative object configuration
When using declarative object configuration, a user operates on object configuration files stored locally, however the user does not define the operations to be taken on the files. Create, update, and delete operations are automatically detected per-object by kubectl. This enables working on directories, where different operations might be needed for different objects.

Stateless VS Stateful!
A stateless application is one that neither reads nor stores information about its state from one time that it is run to the next. State" in this case can refer to any changeable condition, including the results of internal operations, interactions with other applications or services, user-set preferences, environment variables, the contents of memory or temporary storage, or files opened, read from, or written to.

A key point to keep in mind is that statefulness requires persistent storage. An application can only be stateful if it has somewhere to store information about its state, and if that information will be available for it to read later.

For an application running on a typical desktop system, that generally isn't a problem. It can usually store state data in a temp file, a database, or the system registry. Network- and Internet-based applications may be able to store state data on individual users' systems (for example, in the form of cookies), or on the server. As long as there is some kind of persistent storage, it is possible for a stateful application to save state data.

But what about containers? The ideal container, after all, pops up out of nowhere, does its job, and disappears. If it performs any operations involving data coming from/going to somewhere else, it is given the data by another process or service, and in turn hands the result off to some other process. Where could it store any information about its state? As originally conceived, containers couldn't save state information.

How can a container be stateful, if it doesn't have persistent storage? There are now several well-established vendors that do provide persistent storage for containers, including databases for storing container state information.

Companies such as Docker, Kubernetes, Flocker, and Mesosphere provide ways of managing both stateless and stateful containers using persistently stored data. Most of the key vendors in the container industry appear to see statefulness as a major part of the container landscape, and one that is here to stay, rather than being a vestige of pre-container development style. For most developers, the question is not whether to use stateful containers, but when they should be used.

As we said earlier, the advantage of statelessness is that it is simple. Statefulness, on the other hand, does require at least some overhead: persistent storage, and more likely, a state management system. This means more software to install, manage, and configure, and more programming time to connect to it via API.

When to Use Multiple Namespaces
Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces.

Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all.

Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces cannot be nested inside one another and each Kubernetes resource can only be in one namespace.

Namespaces are a way to divide cluster resources between multiple users (via resource quota).

You can list the current namespaces in a cluster using:
kubectl get namespace
default Active 1d
kube-node-lease Active 1d
kube-public Active 1d
kube-system Active 1d

Kubernetes manifests are used to create, modify and delete Kubernetes resources such as pods, deployments, services or ingresses. It is very common to define manifests in form of .yaml files and send them to the Kubernetes API Server via commands such as kubectl apply -f my-file.yaml or kubectl delete -f my-file.yaml.

Kubernetes doesn’t run containers directly; instead it wraps one or more containers into a higher-level structure called a pod. Any containers in the same pod will share the same resources and local network. Containers can easily communicate with other containers in the same pod as though they were on the same machine while maintaining a degree of isolation from others.

Pods can hold multiple containers, but you should limit yourself when possible.

A deployment’s primary purpose is to declare how many replicas of a pod should be running at a time. When a deployment is added to the cluster, it will automatically spin up the requested number of pods, and then monitor them. If a pod dies, the deployment will automatically re-create it.

YAML (Yet Another Markup Language) is a superset of JSON, which means that it has all the functionality of JSON, but it also extends this functionality to some degree.

Here is an example deployment.yaml file:

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
  name: nginx-deployment
      app: nginx
  replicas: 2 # tells deployment to run 2 pods matching the template
        app: nginx
      - name: nginx
        image: nginx:1.7.9
        - containerPort: 80
Ephemeral cluster vs long running elastic cluster
For something like Spark Streaming or Flink (say for stream processing) where the job is "infinite", it makes sense to have a long-running EMR(Elastic MapReduce) cluster. Another use-case will be an OLTP workload like Hbase.

For batch analytics and ad-hoc queries, it's usually a good idea to use ephemeral clusters. There are some caveats though; you still need a long-running metastore (either Hive metastore or Glue to hold your metadata), for more complex workflows and job interdependencies you can use something like Apache Airflow to orchestrate your ephemeral workloads.

No comments:

Post a Comment

Data Science Study Notes: recommendation engine notes 1: Deep matrix factorization using Apache MXNet

Deep matrix factorization using Apache MXNet ( notes from Oreilly , github notebook ) Recommendation engines are widely used models th...