Saturday, December 7, 2019

GCP Study Notes 2: start running jupyter in 5 mins from GCP, Bigqery example



The most easy way to run jupyter notebook in GCP
There are several ways to run jupyter notebook in GCP, you can starting from creating dataproc cluster, or VM, then install the anaconda package there. It involves quite a few configue setting and might not that easy for the traditional data scientist with more statitics background. Here is the most easy way I found you can run the jupyter session in GCP via a fewest clicks.

Go to left menu of GCP, find "AI platform", then click "notebook", you can easily create a notebook with a few clicks. After you create the notebook, click "open jupyterlab", then you are in the traditional jupyter environment, you can create .txt file and rename to .py file, running like python spyder.

Another powerful part is the tutorial there, once you are in the jupyter environment, you can see a "tutorial" folder, which has a lot of tutorial examples you can try there.

Error: "cannot import name automl_v1beta1", "cannot import name automlwrapper","ImportError: No module named 'automlwrapper'"

When you are in the jupyter emvironment, you were trying to import automlwrapper or automl_v1beta1, you might find the above error message. Here is some quick way to fix:
#===========================================================
#run the following in the terminal
pip install --user --upgrade  google-cloud-automl
#then re-start the kernel in jupyter, run import again:
from google.cloud import storage
from google.cloud import automl_v1beta1 as automl
from automlwrapper import AutoMLWrapper
#===========================================================

Here are some example code of creating dataproc cluster and pre-install python.
#===========================================================
#!/bin/bash
set -euo pipefail

DIR="${BASH_SOURCE%/*}"
[[ ! -d "${DIR}" ]] && DIR="${PWD}"
readonly DIR

echo "Creating dataproc cluster..."

function usage {
  cat << EOF
usage: $0 [-h] [-c=cluster-name]
  -h                 display help
  -c=cluster-name    specify unique dataproc cluster name to launch
EOF
  exit 1
}

for i in "$@"
do
  case $i in
    -c=*)
      DATAPROC_CLUSTER_NAME="${i#*=}"
      shift # past argument=value
      ;;
    -h)
      usage
      ;;
    *)
      ;;
  esac
done

[[ -c "${DATAPROC_CLUSTER_NAME}" ]] && usage
echo "Using following cluster name: ${DATAPROC_CLUSTER_NAME}"

PROJECT_ID="****-edg-***-poc-****"
PROJECT_NAME="****-edg-***-poc"
REGION="global"
SERVICE_ACCOUNT="dataproc-cl-admin-sa@$PROJECT_ID.iam.gserviceaccount.com"

# NOTE: change the following parameters to suit your needs
# master-boot-disk-size
# worker-boot-disk-size
# num-workers
# num-preemptible-workers (if needed) ADD BACKSLASH TO PREVIOUS UNCOMMENTED LINE IF YOU USE THIS
# max-idle (if needed) ADD BACKSLASH TO PREVIOUS UNCOMMENTED LINE IF YOU USE THIS
gcloud beta dataproc clusters create ${DATAPROC_CLUSTER_NAME} \
    --image-version 1.4 \
    --zone=us-west1-a \
    --bucket=edg-dsa-users \
    --enable-component-gateway \
    --metadata 'MINICONDA_VARIANT=3' \
    --metadata 'MINICONDA_VERSION=latest' \
    --metadata 'CONDA_PACKAGES=jupyterlab' \
    --metadata 'PIP_PACKAGES=pandas patsy datetime pyhive pandas_gbq argparse sklearn gcsfs pandasql keras tensorflow seaborn pandas_profiling google-cloud-storage paramiko google-cloud-automl' \
    --metadata 'PYTHON_VERSION=3.7' \
    --subnet "projects/****-network-sbx-****/regions/us-east1/subnetworks/****-data-svcs-1-us-w1-sbx-subnet" \
    --scopes 'https://www.googleapis.com/auth/cloud-platform' \
    --tags allow-ssh,egress-nat-gce \
    --project $PROJECT_ID \
    --service-account $SERVICE_ACCOUNT \
    --num-workers=2 \
    --num-preemptible-workers=0 \
    --master-machine-type=n1-highmem-96 \
    --worker-machine-type=n1-standard-1 \
    --master-boot-disk-size=1000GB \
    --worker-boot-disk-size=15GB \
    --properties "yarn:yarn.scheduler.maximum-allocation-mb=241664,yarn:yarn.nodemanager.resource.memory-mb=241664" \
    --initialization-actions \
    gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh,gs://edg-dsa-scripts/dataproc/jupyter.sh,gs://dataproc-initialization-actions/hue/hue.sh,gs://dataproc-initialization-actions/python/pip-install.sh
#===========================================================


What's the difference between Kubernetes and Docker?

There are two kinds of usage: committed and sustained discount usage.

vCPU = unit of measure of CPU.
RAM is measure in unit of Gigabite.

There are 3 different ways to do dev on GCP: command line(gcloud), REST API, Google Cloud Consule.

HDD: harddisk device.
SSD: Solid State Device. SSD is more expensive than HDD, physically more closely to the VM.

2 kinds of VM: standard and preemptible: preempitble virtual machine(PVM) are more cheap(up to 80% cheaper than regular instances), technically the same as standard, but charged on the excess usage, is always terminated after 24 hours and when the GCP usage spikes, also terminates, whichever comes first. This is the reason they might be stopped at any time.
#===============================================================
// ENABLE PREEMPTIBLE OPTION
gcloud compute instances create my-vm --zone us-central1-b --preemptible
#===============================================================

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes was built by Google based on their experience running containers in production using an internal cluster management system called Borg.

In GCP: Compute engine ~ Virtual Machine = AWS (EC2)

Here is some example using papermill to run a collection of jupyter notebooks automatically, output to GCS, and then shut down to save the cost.
The following command starts execution of a Jupyter notebook stored in a Cloud Storage bucket:

#===============================================================
# Compute Engine Instance parameters
export IMAGE_FAMILY="tf-latest-cu100" 
export ZONE="us-central1-b"
export INSTANCE_NAME="notebook-executor"
export INSTANCE_TYPE="n1-standard-8"
# Notebook parameters
export INPUT_NOTEBOOK_PATH="gs://my-bucket/input.ipynb"
export OUTPUT_NOTEBOOK_PATH="gs://my-bucket/output.ipynb"
export PARAMETERS_FILE="params.yaml" # Optional
export PARAMETERS="-p batch_size 128 -p epochs 40"  # Optional
export STARTUP_SCRIPT="papermill ${INPUT_NOTEBOOK_PATH} ${OUTPUT_NOTEBOOK_PATH} -y ${PARAMETERS_FILE} ${PARAMETERS}"
gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator='type=nvidia-tesla-t4,count=2' \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=100GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --metadata="install-nvidia-driver=True,startup-script=${STARTUP_SCRIPT}"
gcloud --quiet compute instances delete $INSTANCE_NAME --zone $ZONE
#===============================================================


The above commands do the following:

Create a Compute Engine instance using TensorFlow Deep Learning VM and 2 NVIDIA Tesla T4 GPUs
Install the latest NVIDIA GPU drivers
Execute the notebook using Papermill

Upload notebook result (with all the cells pre-computed) to Cloud Storage bucket in this case: “gs://my-bucket/”

Terminate the Compute Engine instance
After clicking activate google cloud Shell, in the terminal, you can view your roles by running:

gcloud iam roles describe roles/viewer
gcloud iam roles describe roles/editor
gcloud iam roles describe roles/owner

Google Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way.
Cloud Dataproc easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning

Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib.
Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests,
alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Google Cloud Dataproc, in order to process very large datasets in parallel.

Relational databases are also called Relational Database Management Systems (RDBMS) or SQL databases. Historically,
the most popular of these have been Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.
Relational databases are designed to run on a single server in order to maintain the integrity of the table mappings and avoid the problems of distributed computing.

Non-relational databases are also called NoSQL databases. NoSQL has become an industry standard term,
but the name is beginning to lose popularity since it doesn’t fully cover the complexity and range of non-relational data stores that are available.
Some of the most known NoSQL or non-relational DBs that Serra discussed are MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.

Hadoop is a file system with components made up of Hadoop Distributed File System (HDFS), Yarn, and MapReduce.

Cloud Spanner is the first scalable, enterprise-grade, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.

Difference between horizontally scaling and vertically scaling:

In simple layman language, Horizontal scaling is more complex than the vertical scaling(adding more cpu,memory at the existing machine).

Horizontal scaling means that you scale by adding more machines into your pool of resources
whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.

Horizontal Scaling, also referred to as "scale-out" is basically the addition of more machines or setting up a cluster or a distributed environment for your software system.
This usually requires a load-balancer program which is a middle-ware component in the standard 3 tier client-server architectural model.

Vertical Scaling - also referred to as "scale-up" approach is an attempt to increase the capacity of a single machine :
By adding more processing power By adding more storage More memory etc.

Horizontal scaling comes with overhead in the form of cluster setup, management, and maintenance costs and complexities.
The design gets increasingly complex and programming model changes.

Relational Databases:

Pros: Relational databases work with structured data.
They support ACID transactional consistency and support “joins.”
They come with built-in data integrity and a large eco-system.
Relationships in this system have constraints.
There is limitless indexing. Strong SQL.

Cons:
Relational Databases do not scale out horizontally very well (concurrency and data size), only vertically, (unless you use sharding).
Data is normalized, meaning lots of joins, which affects speed.
They have problems working with semi-structured data.

Non-relational/NoSQL
Pros:
They scale out horizontally and work with unstructured and semi-structured data. Some support ACID transactional consistency.
Schema-free or Schema-on-read options.
High availability.
While many NoSQL databases are open source and so “free”, there are often considerable training, setup, and developments costs. There are now also numerous commercial products available.

Cons:
Weaker or eventual consistency (BASE) instead of ACID.
Limited support for joins.
Data is denormalized, requiring mass updates (i.e. product name change).
Does not have built-in data integrity (must do in code).
Limited indexing.

What is SSH?
Secure Shell, sometimes referred to as Secure Socket Shell, is a protocol which allows you to connect securely to a remote computer or a server by using a text-based interface.

When a secure SSH connection is established, a shell session will be started, and you will be able to manipulate the server by typing commands within the client on your local computer.


Question:
ImportError: No module named google.cloud


Answer:
1. pip install --upgrade google-cloud-storage
2. restart the kerkel.

if you run the following: pip install google-cloud, it might not work.

check GCP creditial:
import os
print('Credendtials from environ: {}'.format(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')))



Question:
What's the product matching between AWS and GCP?

Answer:

Service Category Service AWS Google Cloud Platform
Compute IaaS Amazon Elastic Compute Cloud(EC2) Compute Engine
PaaS AWS Elastic Beanstalk App Engine
Containers Amazon Elastic Container Service Google Kubernetes Engine
Serverless Functions AWS Lambda Cloud Functions
Managed Batch Computing AWS Batch N/A
Network Virtual Networks Amazon Virtual Private Cloud Virtual Private Cloud
Load Balancer Elastic Load Balancer Cloud Load Balancing
Dedicated Interconnect Direct Connect Cloud Interconnect
Domains and DNS Amazon Route 53 Google Domains, Cloud DNS
CDN Amazon CloudFront Cloud CDN
Storage Object Storage Amazon Simple Storage Service(S3) Cloud Storage
Block Storage Amazon Elastic Block Store Persistent Disk
Reduced-availability Storage Amazon S3 One Zone-Infrequent
Access
Cloud Storage Nearline
Archival Storage Amazon Glacier Cloud Storage Coldline
File Storage Amazon Elastic File System Cloud Filestore (beta)
Database RDBMS Amazon Aurora Cloud SQL,
Cloud Spanner
NoSQL: Key-value Amazon DynamoDB Cloud Firestore,
Cloud Bigtable
NoSQL: Indexed Amazon SimpleDB Cloud Firestore
Big Data & Analytics Batch Data Processing Amazon Elastic MapReduce, AWS Batch Cloud Dataproc,
Cloud Dataflow
Stream Data Processing Amazon Kinesis Cloud Dataflow
Stream Data Ingest Amazon Kinesis Cloud Pub/Sub
Analytics Amazon Redshift, Amazon Athena BigQuery
Workflow Orchestration Amazon Data Pipeline, AWS Glue Cloud Composer
Application Services Messaging Amazon Simple Notification Service Cloud Pub/Sub
Management Services Monitoring Amazon CloudWatch Stackdriver Monitoring
Logging Amazon CloudWatch Logs Stackdriver Logging
Deployment AWS CloudFormation Cloud Deployment Manager
Machine Learning Speech Amazon Transcribe Cloud Speech-to-Text
Vision Amazon Rekognition Cloud Vision
Natural Language Processing Amazon Comprehend Cloud Natural Language
Translation Amazon Translate Cloud Translation
Conversational Interface Amazon Lex Dialogflow
Enterprise Edition
Video Intelligence Amazon Rekognition Video Cloud Video Intelligence
Auto-generated Models N/A Cloud AutoML (beta)
Fully Managed ML Amazon SageMaker Cloud Machine Learning Engine

Example of using google bigquery on GCP in 15 min
#===========================================================
#running in the bigquery API SQL: 
SELECT
  name, gender,
  SUM(number) AS total
FROM
  `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY
  name, gender
ORDER BY
  total DESC
LIMIT
  10

create table `maximal-inkwell-#####.sample1.test`  
PARTITION BY myDate
CLUSTER BY cluster_col 
as 
SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013` LIMIT 10


{CREATE TABLE | CREATE TABLE IF NOT EXISTS | CREATE OR REPLACE TABLE}
table_name
[(
  column_name column_schema[, ...]
)]
[PARTITION BY partition_expression]
[CLUSTER BY clustering_column_list]
[OPTIONS(table_option_list)]
[AS query_statement]


##Here are the python code: 
import time

from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset

def create_table_as_select(dataset_name, table_name, sqlQuery, project=None):
    try:
    job_config = bigquery.QueryJobConfig()

    # Set configuration.query.destinationTable
    dataset_ref = bigquery_client.dataset(dataset_name)
    table_ref = dataset_ref.table(table_name)

    job_config.destination = table_ref

    # Set configuration.query.createDisposition
    job_config.create_disposition = 'CREATE_IF_NEEDED'

    # Set configuration.query.writeDisposition
    job_config.write_disposition = 'WRITE_APPEND'

    # Start the query
    job = bigquery_client.query(sqlQuery, job_config=job_config)

    # Wait for the query to finish
    job.result()

    returnMsg = 'Created table {} .'.format(table_name)

    return returnMsg

except Exception as e:
    errorStr = 'ERROR (create_table_as_select): ' + str(e)
    print(errorStr)
    raise
#===========================================================

We can also run bigquery sql from the cloud shell:
#===========================================================
bq query "select string_field_10 as request, count(*) as requestcount 
from logdata.accesslog group by request order by requestcount desc"
#===========================================================

#===========================================================# Create a dataset using bigquery API:
bigquery_client  = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset' 
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref)  # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside. 
Remember this should be same location as that of source data set from where we are getting data to run a query

# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset)  # API request
print('Dataset {} created.'.format(dataset.dataset_id))

#===========================================================

No comments:

Post a Comment

Data Science Study Notes: How do we detect outliers (anomaly records) in python?

In this blogger we will cover what's the major difference for different approaches to detect outliers (anomaly records)  in python, a...