Sunday, April 5, 2020

GCP Study Notes 4: GCP Big Data and Machine Learning Fundamentals (coursera notes)

lab to exploration:
Create Cloud SQL instance
Create database tables by importing .sql files from Cloud Storage
Populate the tables by importing .csv files from Cloud Storage
Allow access to Cloud SQL
Explore the rentals data using SQL statements from CloudShell

#================================================
CREATE DATABASE IF NOT EXISTS recommendation_spark;
USE recommendation_spark;

DROP TABLE IF EXISTS Recommendation;
DROP TABLE IF EXISTS Rating;
DROP TABLE IF EXISTS Accommodation;

CREATE TABLE IF NOT EXISTS Accommodation
(
id varchar(255),
title varchar(255),
location varchar(255),
price int,
rooms int,
rating float,
type varchar(255),
PRIMARY KEY (ID)
);

CREATE TABLE  IF NOT EXISTS Rating
(
userId varchar(255),
accoId varchar(255),
rating int,
PRIMARY KEY(accoId, userId),
FOREIGN KEY (accoId)
REFERENCES Accommodation(id)
);

CREATE TABLE  IF NOT EXISTS Recommendation
(
userId varchar(255),
accoId varchar(255),
prediction float,
PRIMARY KEY(userId, accoId),
FOREIGN KEY (accoId)
REFERENCES Accommodation(id)
);

SHOW DATABASES;

root pw: r*****s0987
--code to connect to database in SQL: 
gcloud sql connect rentals --user=root --quiet


--code to create buckets and copy data/files:
echo "Creating bucket: gs://$DEVSHELL_PROJECT_ID"
gsutil mb gs://$DEVSHELL_PROJECT_ID

echo "Copying data to our storage from public dataset"
gsutil cp gs://cloud-training/bdml/v2.0/data/accommodation.csv gs://$DEVSHELL_PROJECT_ID
gsutil cp gs://cloud-training/bdml/v2.0/data/rating.csv gs://$DEVSHELL_PROJECT_ID

echo "Show the files in our bucket"
gsutil ls gs://$DEVSHELL_PROJECT_ID

echo "View some sample data"
gsutil cat gs://$DEVSHELL_PROJECT_ID/accommodation.csv
#===========================================================

Lab-1: use Dataproc to train the recommendations machine learning model based on users' previous ratings. You then apply that model to create a list of recommendations for every user in the database. In this lab, you will:
Launch Dataproc
Train and apply ML model written in PySpark to create product recommendations
Explore inserted rows in Cloud SQL
#================================================
echo "Authorizing Cloud Dataproc to connect with Cloud SQL"
CLUSTER=rentals
CLOUDSQL=rentals
ZONE=us-central1-a
NWORKERS=2

machines="$CLUSTER-m"
for w in `seq 0 $(($NWORKERS - 1))`; do
   machines="$machines $CLUSTER-w-$w"
done

echo "Machines to authorize: $machines in $ZONE ... finding their IP addresses"
ips=""
for machine in $machines; do
    IP_ADDRESS=$(gcloud compute instances describe $machine --zone=$ZONE --format='value(networkInterfaces.accessConfigs[].natIP)' | sed "s/\[u'//g" | sed "s/'\]//g" )/32
    echo "IP address of $machine is $IP_ADDRESS"
    if [ -z  $ips ]; then
       ips=$IP_ADDRESS
    else
       ips="$ips,$IP_ADDRESS"
    fi
done

echo "Authorizing [$ips] to access cloudsql=$CLOUDSQL"
gcloud sql instances patch $CLOUDSQL --authorized-networks $ips
#===========================================================

Your data science team has created a recommendation model using Apache Spark and written in Python. Let's copy it over into our staging bucket.
#================================================
gsutil cp gs://cloud-training/bdml/v2.0/model/train_and_apply.py train_and_apply.py
cloudshell edit train_and_apply.py

#!/usr/bin/env python

import os
import sys
import pickle
import itertools
from math import sqrt
from operator import add
from os.path import join, isfile, dirname
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql.types import StructType, StructField, StringType, FloatType

# MAKE EDITS HERE
CLOUDSQL_INSTANCE_IP = '34.68.30.28'   # <---- CHANGE (database server IP)
CLOUDSQL_DB_NAME = 'recommendation_spark' # <--- leave as-is
CLOUDSQL_USER = 'root'  # <--- leave as-is
CLOUDSQL_PWD  = 'r****s0987'  # <---- CHANGE

# DO NOT MAKE EDITS BELOW
conf = SparkConf().setAppName("train_model")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

jdbcDriver = 'com.mysql.jdbc.Driver'
jdbcUrl    = 'jdbc:mysql://%s:3306/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)

# checkpointing helps prevent stack overflow errors
sc.setCheckpointDir('checkpoint/')

# Read the ratings and accommodations data from Cloud SQL
dfRates = sqlContext.read.format('jdbc').options(driver=jdbcDriver, url=jdbcUrl, dbtable='Rating', useSSL='false').load()
dfAccos = sqlContext.read.format('jdbc').options(driver=jdbcDriver, url=jdbcUrl, dbtable='Accommodation', useSSL='false').load()
print("read ...")

# train the model
model = ALS.train(dfRates.rdd, 20, 20) # you could tune these numbers, but these are reasonable choices
print("trained ...")

# use this model to predict what the user would rate accommodations that she has not rated
allPredictions = None
for USER_ID in range(0, 100):
dfUserRatings = dfRates.filter(dfRates.userId == USER_ID).rdd.map(lambda r: r.accoId).collect()
rddPotential  = dfAccos.rdd.filter(lambda x: x[0] not in dfUserRatings)
pairsPotential = rddPotential.map(lambda x: (USER_ID, x[0]))
predictions = model.predictAll(pairsPotential).map(lambda p: (str(p[0]), str(p[1]), float(p[2])))
predictions = predictions.takeOrdered(5, key=lambda x: -x[2]) # top 5
print("predicted for user={0}".format(USER_ID))
if (allPredictions == None):
allPredictions = predictions
else:
allPredictions.extend(predictions)

# write them
schema = StructType([StructField("userId", StringType(), True), StructField("accoId", StringType(), True), StructField("prediction", FloatType(), True)])
dfToSave = sqlContext.createDataFrame(allPredictions, schema)
dfToSave.write.jdbc(url=jdbcUrl, table='Recommendation', mode='overwrite')
#===========================================================
Lab 2: Predict Visitor Purchases with a Classification Model with BigQuery ML
In this lab, you learn to perform the following tasks:
Use BigQuery to find public datasets
Query and explore the ecommerce dataset
Create a training and evaluation dataset to be used for batch prediction
Create a classification (logistic regression) model in BQML
Evaluate the performance of your machine learning model
Predict and rank the probability that a visitor will make a purchase

#================================================
#1. train the dataset to generate a model:  
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS

#standardSQL
SELECT
  * EXCEPT(fullVisitorId)
FROM
  # features
  (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
  FROM
    `data-to-insights.ecommerce.web_analytics`
  WHERE
    totals.newVisits = 1
    AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
  JOIN
  (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM
      `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid)
  USING (fullVisitorId)
;

#2. evaluate the model performance: roc etc
SELECT
  roc_auc,
  CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'not great'
  ELSE 'poor' END AS model_quality
FROM
  ML.EVALUATE(MODEL ecommerce.classification_model,  (

SELECT
  * EXCEPT(fullVisitorId)
FROM

  # features
  (SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
  FROM
    `data-to-insights.ecommerce.web_analytics`
  WHERE
    totals.newVisits = 1
    AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
  JOIN
  (SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM
      `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid)
  USING (fullVisitorId)

)); 
#3. predict the new visitor to see if he/she will buy on the later visit: SELECT * FROM ml.PREDICT(MODEL `ecommerce.classification_model_2`, ( WITH all_visitor_stats AS ( SELECT fullvisitorid, IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit FROM `data-to-insights.ecommerce.web_analytics` GROUP BY fullvisitorid ) SELECT CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id, # labels will_buy_on_return_visit, MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress, # behavior on the site IFNULL(totals.bounces, 0) AS bounces, IFNULL(totals.timeOnSite, 0) AS time_on_site, totals.pageviews, # where the visitor came from trafficSource.source, trafficSource.medium, channelGrouping, # mobile or desktop device.deviceCategory, # geographic IFNULL(geoNetwork.country, "") AS country FROM `data-to-insights.ecommerce.web_analytics`, UNNEST(hits) AS h JOIN all_visitor_stats USING(fullvisitorid) WHERE # only predict for new visits totals.newVisits = 1 AND date BETWEEN '20170701' AND '20170801' # test 1 month GROUP BY unique_session_id, will_buy_on_return_visit, bounces, time_on_site, totals.pageviews, trafficSource.source, trafficSource.medium, channelGrouping, device.deviceCategory, country ) ) ORDER BY predicted_will_buy_on_return_visit DESC; #===========================================================


Create a dataset in GCP BigQuery via Console UI:
Crate dataset(database), then in each dataset, you have a separate table/data.


#================================================
ride_id:string,
point_idx:integer,
latitude:float,
longitude:float,
timestamp:timestamp,
meter_reading:float,
meter_increment:float,
ride_status:string,
passenger_count:integer
#===========================================================

Classify Images with Pre-built ML Models using Cloud Vision API and AutoML
In this lab you will upload images to Cloud Storage and use them to train a custom model to recognize different types of clouds (cumulus, cumulonimbus, etc.)
1. Upload a labeled dataset to Google Cloud Storage and connect it to AutoML Vision with a CSV label file.
2. Train a model with AutoML Vision and evaluate its accuracy.
3. Generate predictions on your trained model.

First Step: after enable Cloud AutoML Vision API, setup the project ID from the Cloud Shell:
export PROJECT_ID=qwiklabs-gcp-02-1258a4b7a7ca
export QWIKLABS_USERNAME=student-02-ca7927dfff9d@qwiklabs.net

Now, create a Storage Bucket for the images you will use in testing. Create one by running the following command:

gsutil mb -p $PROJECT_ID \
-c regional \
-l us-central1 \
gs://$PROJECT_ID-vcm/


Before you add the cloud images, create an environment variable with the name of your bucket by running the following command in Cloud Shell, replacing YOUR_BUCKET_NAME in the command below with the name of your bucket:
export BUCKET=$PROJECT_ID-vcm

Notice you might need "/" at the end sometimes, you can double check if you have assign the value correctly on the cloud shell command line: cat $BUCKET

The training images are publicly available in a Cloud Storage bucket. Use the gsutil command line utility for Cloud Storage to copy the training images into your bucket:
gsutil -m cp -r gs://automl-codelab-clouds/* gs://${BUCKET}/

When the images finish copying, click the Refresh button at the top of the Cloud Storage browser. Then click on your bucket name. You should see 3 folders of photos for each of the 3 different cloud types to be classified.

Now that your training data is in Cloud Storage, you need a way for AutoML Vision to access it. You'll create a CSV file where each row contains a URL to a training image and the associated label for that image. This CSV file has been created for you; you just need to update it with your bucket name.

Run the following command to copy the file to your Cloud Shell instance:

gsutil cp gs://automl-codelab-metadata/data.csv .

Then update the CSV with the files in your project:

sed -i -e "s/placeholder/${BUCKET}/g" ./data.csv
--this is to replace the string "placeholder" by the bucket name

Replace text at line 3
sed -e "3s/oldtext/newtext/" infile > outfile

Replace all occurances
sed -e "s/oldtext/newtext/g" infile > outfile

Change aaa to bbb
outfile = 'echo $infile/sed -e "ls/aaa/bbb"'

Print line i
sed -n "$i"p infile

Now you're ready to upload this file to your Cloud Storage bucket:

gsutil cp ./data.csv gs://${BUCKET}


Step2: Create the training dataset on the AutoML Vision API:
At the top of the console, click + NEW DATASET.
Type "clouds" for the Dataset name.
Leave "Single-label Classification" checked.

In the csv training dataset, it should have some values like:
[set,]image_path[,label]
TRAIN,gs://My_Bucket/sample1.jpg,cat
TEST,gs://My_Bucket/sample2.jpg,dog

After finished the training, we can apply the prediction by click "Test and USe" or run this in the Cloud shell:
python predict.py YOUR_LOCAL_IMAGE_FILE 833728172698 ICN1336966556957016064

or via Python:
#===========================================================import sys

from google.cloud import automl_v1beta1
from google.cloud.automl_v1beta1.proto import service_pb2


# 'content' is base-64-encoded image data.
def get_prediction(content, project_id, model_id):
prediction_client = automl_v1beta1.PredictionServiceClient()

name = 'projects/{}/locations/us-central1/models/{}'.format(project_id, model_id)
payload = {'image': {'image_bytes': content }}
params = {}
request = prediction_client.predict(name, payload, params)
return request # waits till request is returned

if __name__ == '__main__':
file_path = sys.argv[1]
project_id = sys.argv[2]
model_id = sys.argv[3]

with open(file_path, 'rb') as ff:
content = ff.read()

print get_prediction(content, project_id, model_id)
#===========================================================
In summary, there are 3 ways to do image classification on GCP:
1. Use Bigquery to run the image classification training and prediction in SQL.
2. USe AutoML Vision API to load the image data, csv file with URL(in GCP storage), to explore/train/evaluate/predict.
3. Use Deep learning VM /AI platform/ ML Engine Jupyter Notebook (can write own model with Keras) to write python code to explore/train/evaluate/predict.


No comments:

Post a Comment

GCP Study notes 13: Architecting with Google Kubernetes Engine: Foundations (courseRA notes)

Architecting with Google Compute Engine Specialization : 4 Courses in this Specialization. 1. Google Cloud Platform Fundamentals: Core In...