The most easy way to run jupyter notebook in GCP
There are several ways to run jupyter notebook in GCP, you can starting from creating dataproc cluster, or VM, then install the anaconda package there. It involves quite a few configue setting and might not that easy for the traditional data scientist with more statitics background. Here is the most easy way I found you can run the jupyter session in GCP via a fewest clicks.
Go to left menu of GCP, find "AI platform", then click "notebook", you can easily create a notebook with a few clicks. After you create the notebook, click "open jupyterlab", then you are in the traditional jupyter environment, you can create .txt file and rename to .py file, running like python spyder.
Another powerful part is the tutorial there, once you are in the jupyter environment, you can see a "tutorial" folder, which has a lot of tutorial examples you can try there.
Error: "cannot import name automl_v1beta1", "cannot import name automlwrapper","ImportError: No module named 'automlwrapper'"
When you are in the jupyter emvironment, you were trying to import automlwrapper or automl_v1beta1, you might find the above error message. Here is some quick way to fix:
#===========================================================
#run the following in the terminal
pip install --user --upgrade google-cloud-automl
#then re-start the kernel in jupyter, run import again:
from google.cloud import storage
from google.cloud import automl_v1beta1 as automl
from automlwrapper import AutoMLWrapper
#===========================================================
Here are some example code of creating dataproc cluster and pre-install python.
#===========================================================
#!/bin/bash
set -euo pipefail
DIR="${BASH_SOURCE%/*}"
[[ ! -d "${DIR}" ]] && DIR="${PWD}"
readonly DIR
echo "Creating dataproc cluster..."
function usage {
cat << EOF
usage: $0 [-h] [-c=cluster-name]
-h display help
-c=cluster-name specify unique dataproc cluster name to launch
EOF
exit 1
}
for i in "$@"
do
case $i in
-c=*)
DATAPROC_CLUSTER_NAME="${i#*=}"
shift # past argument=value
;;
-h)
usage
;;
*)
;;
esac
done
[[ -c "${DATAPROC_CLUSTER_NAME}" ]] && usage
echo "Using following cluster name: ${DATAPROC_CLUSTER_NAME}"
PROJECT_ID="****-edg-***-poc-****"
PROJECT_NAME="****-edg-***-poc"
REGION="global"
SERVICE_ACCOUNT="dataproc-cl-admin-sa@$PROJECT_ID.iam.gserviceaccount.com"
# NOTE: change the following parameters to suit your needs
# master-boot-disk-size
# worker-boot-disk-size
# num-workers
# num-preemptible-workers (if needed) ADD BACKSLASH TO PREVIOUS UNCOMMENTED LINE IF YOU USE THIS
# max-idle (if needed) ADD BACKSLASH TO PREVIOUS UNCOMMENTED LINE IF YOU USE THIS
gcloud beta dataproc clusters create ${DATAPROC_CLUSTER_NAME} \
--image-version 1.4 \
--zone=us-west1-a \
--bucket=edg-dsa-users \
--enable-component-gateway \
--metadata 'MINICONDA_VARIANT=3' \
--metadata 'MINICONDA_VERSION=latest' \
--metadata 'CONDA_PACKAGES=jupyterlab' \
--metadata 'PIP_PACKAGES=pandas patsy datetime pyhive pandas_gbq argparse sklearn gcsfs pandasql keras tensorflow seaborn pandas_profiling google-cloud-storage paramiko google-cloud-automl' \
--metadata 'PYTHON_VERSION=3.7' \
--subnet "projects/****-network-sbx-****/regions/us-east1/subnetworks/****-data-svcs-1-us-w1-sbx-subnet" \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--tags allow-ssh,egress-nat-gce \
--project $PROJECT_ID \
--service-account $SERVICE_ACCOUNT \
--num-workers=2 \
--num-preemptible-workers=0 \
--master-machine-type=n1-highmem-96 \
--worker-machine-type=n1-standard-1 \
--master-boot-disk-size=1000GB \
--worker-boot-disk-size=15GB \
--properties "yarn:yarn.scheduler.maximum-allocation-mb=241664,yarn:yarn.nodemanager.resource.memory-mb=241664" \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh,gs://edg-dsa-scripts/dataproc/jupyter.sh,gs://dataproc-initialization-actions/hue/hue.sh,gs://dataproc-initialization-actions/python/pip-install.sh
#===========================================================
What's the difference between Kubernetes and Docker?
There are two kinds of usage: committed and sustained discount usage.
vCPU = unit of measure of CPU.
RAM is measure in unit of Gigabite.
There are 3 different ways to do dev on GCP: command line(gcloud), REST API, Google Cloud Consule.
HDD: harddisk device.
SSD: Solid State Device. SSD is more expensive than HDD, physically more closely to the VM.
2 kinds of VM: standard and preemptible: preempitble virtual machine(PVM) are more cheap(up to 80% cheaper than regular instances), technically the same as standard, but charged on the excess usage, is always terminated after 24 hours and when the GCP usage spikes, also terminates, whichever comes first. This is the reason they might be stopped at any time.
#===============================================================
// ENABLE PREEMPTIBLE OPTION
gcloud compute instances create my-vm --zone us-central1-b --preemptible
#===============================================================
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes was built by Google based on their experience running containers in production using an internal cluster management system called Borg.
In GCP: Compute engine ~ Virtual Machine = AWS (EC2)
Here is some example using papermill to run a collection of jupyter notebooks automatically, output to GCS, and then shut down to save the cost.
The following command starts execution of a Jupyter notebook stored in a Cloud Storage bucket:
#===============================================================
# Compute Engine Instance parameters
export IMAGE_FAMILY="tf-latest-cu100"
export ZONE="us-central1-b"
export INSTANCE_NAME="notebook-executor"
export INSTANCE_TYPE="n1-standard-8"
# Notebook parameters
export INPUT_NOTEBOOK_PATH="gs://my-bucket/input.ipynb"
export OUTPUT_NOTEBOOK_PATH="gs://my-bucket/output.ipynb"
export PARAMETERS_FILE="params.yaml" # Optional
export PARAMETERS="-p batch_size 128 -p epochs 40" # Optional
export STARTUP_SCRIPT="papermill ${INPUT_NOTEBOOK_PATH} ${OUTPUT_NOTEBOOK_PATH} -y ${PARAMETERS_FILE} ${PARAMETERS}"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator='type=nvidia-tesla-t4,count=2' \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=100GB \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata="install-nvidia-driver=True,startup-script=${STARTUP_SCRIPT}"
gcloud --quiet compute instances delete $INSTANCE_NAME --zone $ZONE
#===============================================================
The above commands do the following:
Create a Compute Engine instance using TensorFlow Deep Learning VM and 2 NVIDIA Tesla T4 GPUs
Install the latest NVIDIA GPU drivers
Execute the notebook using Papermill
Upload notebook result (with all the cells pre-computed) to Cloud Storage bucket in this case: “gs://my-bucket/”
Terminate the Compute Engine instance
After clicking activate google cloud Shell, in the terminal, you can view your roles by running:
gcloud iam roles describe roles/viewer
gcloud iam roles describe roles/editor
gcloud iam roles describe roles/owner
Google Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way.
Cloud Dataproc easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning
Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib.
Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests,
alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Google Cloud Dataproc, in order to process very large datasets in parallel.
Relational databases are also called Relational Database Management Systems (RDBMS) or SQL databases. Historically,
the most popular of these have been Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.
Relational databases are designed to run on a single server in order to maintain the integrity of the table mappings and avoid the problems of distributed computing.
Non-relational databases are also called NoSQL databases. NoSQL has become an industry standard term,
but the name is beginning to lose popularity since it doesn’t fully cover the complexity and range of non-relational data stores that are available.
Some of the most known NoSQL or non-relational DBs that Serra discussed are MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.
Hadoop is a file system with components made up of Hadoop Distributed File System (HDFS), Yarn, and MapReduce.
Cloud Spanner is the first scalable, enterprise-grade, globally-distributed, and strongly consistent database service built for the cloud specifically to combine the benefits of relational database structure with non-relational horizontal scale.
Difference between horizontally scaling and vertically scaling:
In simple layman language, Horizontal scaling is more complex than the vertical scaling(adding more cpu,memory at the existing machine).
Horizontal scaling means that you scale by adding more machines into your pool of resources
whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.
Horizontal Scaling, also referred to as "scale-out" is basically the addition of more machines or setting up a cluster or a distributed environment for your software system.
This usually requires a load-balancer program which is a middle-ware component in the standard 3 tier client-server architectural model.
Vertical Scaling - also referred to as "scale-up" approach is an attempt to increase the capacity of a single machine :
By adding more processing power By adding more storage More memory etc.
Horizontal scaling comes with overhead in the form of cluster setup, management, and maintenance costs and complexities.
The design gets increasingly complex and programming model changes.
Relational Databases:
Pros: Relational databases work with structured data.
They support ACID transactional consistency and support “joins.”
They come with built-in data integrity and a large eco-system.
Relationships in this system have constraints.
There is limitless indexing. Strong SQL.
Cons:
Relational Databases do not scale out horizontally very well (concurrency and data size), only vertically, (unless you use sharding).
Data is normalized, meaning lots of joins, which affects speed.
They have problems working with semi-structured data.
Non-relational/NoSQL
Pros:
They scale out horizontally and work with unstructured and semi-structured data. Some support ACID transactional consistency.
Schema-free or Schema-on-read options.
High availability.
While many NoSQL databases are open source and so “free”, there are often considerable training, setup, and developments costs. There are now also numerous commercial products available.
Cons:
Weaker or eventual consistency (BASE) instead of ACID.
Limited support for joins.
Data is denormalized, requiring mass updates (i.e. product name change).
Does not have built-in data integrity (must do in code).
Limited indexing.
What is SSH?
Secure Shell, sometimes referred to as Secure Socket Shell, is a protocol which allows you to connect securely to a remote computer or a server by using a text-based interface.
When a secure SSH connection is established, a shell session will be started, and you will be able to manipulate the server by typing commands within the client on your local computer.
Question:
ImportError: No module named google.cloud
Answer:
1. pip install --upgrade google-cloud-storage
2. restart the kerkel.
if you run the following: pip install google-cloud, it might not work.
check GCP creditial:
import os print('Credendtials from environ: {}'.format(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')))
Question:
What's the product matching between AWS and GCP?
Answer:
Service Category | Service | AWS | Google Cloud Platform |
---|---|---|---|
Compute | IaaS | Amazon Elastic Compute Cloud(EC2) | Compute Engine |
PaaS | AWS Elastic Beanstalk | App Engine | |
Containers | Amazon Elastic Container Service | Google Kubernetes Engine | |
Serverless Functions | AWS Lambda | Cloud Functions | |
Managed Batch Computing | AWS Batch | N/A | |
Network | Virtual Networks | Amazon Virtual Private Cloud | Virtual Private Cloud |
Load Balancer | Elastic Load Balancer | Cloud Load Balancing | |
Dedicated Interconnect | Direct Connect | Cloud Interconnect | |
Domains and DNS | Amazon Route 53 | Google Domains, Cloud DNS | |
CDN | Amazon CloudFront | Cloud CDN | |
Storage | Object Storage | Amazon Simple Storage Service(S3) | Cloud Storage |
Block Storage | Amazon Elastic Block Store | Persistent Disk | |
Reduced-availability Storage | Amazon S3 One Zone-Infrequent Access | Cloud Storage Nearline | |
Archival Storage | Amazon Glacier | Cloud Storage Coldline | |
File Storage | Amazon Elastic File System | Cloud Filestore (beta) | |
Database | RDBMS | Amazon Aurora | Cloud SQL, Cloud Spanner |
NoSQL: Key-value | Amazon DynamoDB | Cloud Firestore, Cloud Bigtable | |
NoSQL: Indexed | Amazon SimpleDB | Cloud Firestore | |
Big Data & Analytics | Batch Data Processing | Amazon Elastic MapReduce, AWS Batch | Cloud Dataproc, Cloud Dataflow |
Stream Data Processing | Amazon Kinesis | Cloud Dataflow | |
Stream Data Ingest | Amazon Kinesis | Cloud Pub/Sub | |
Analytics | Amazon Redshift, Amazon Athena | BigQuery | |
Workflow Orchestration | Amazon Data Pipeline, AWS Glue | Cloud Composer | |
Application Services | Messaging | Amazon Simple Notification Service | Cloud Pub/Sub |
Management Services | Monitoring | Amazon CloudWatch | Stackdriver Monitoring |
Logging | Amazon CloudWatch Logs | Stackdriver Logging | |
Deployment | AWS CloudFormation | Cloud Deployment Manager | |
Machine Learning | Speech | Amazon Transcribe | Cloud Speech-to-Text |
Vision | Amazon Rekognition | Cloud Vision | |
Natural Language Processing | Amazon Comprehend | Cloud Natural Language | |
Translation | Amazon Translate | Cloud Translation | |
Conversational Interface | Amazon Lex | Dialogflow Enterprise Edition | |
Video Intelligence | Amazon Rekognition Video | Cloud Video Intelligence | |
Auto-generated Models | N/A | Cloud AutoML (beta) | |
Fully Managed ML | Amazon SageMaker | Cloud Machine Learning Engine |
Example of using google bigquery on GCP in 15 min
#===========================================================
#running in the bigquery API SQL:
create table `maximal-inkwell-#####.sample1.test`
PARTITION BY myDate
CLUSTER BY cluster_col
as
SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013` LIMIT 10
#here is the syntax:
{CREATE TABLE | CREATE TABLE IF NOT EXISTS | CREATE OR REPLACE TABLE}
table_name
[(
column_name column_schema[, ...]
)]
[PARTITION BY partition_expression]
[CLUSTER BY clustering_column_list]
[OPTIONS(table_option_list)]
[AS query_statement]
######################################################################
##Here are the python code:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
######################################################################
#you can also define a function to create table in bigquery via sql
import time
from google.cloud import bigquery
from google.cloud.bigquery.table import Table
from google.cloud.bigquery.dataset import Dataset
def create_table_as_select(dataset_name, table_name, sqlQuery, project=None):
try:
job_config = bigquery.QueryJobConfig()
# Set configuration.query.destinationTable
dataset_ref = bigquery_client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config.destination = table_ref
# Set configuration.query.createDisposition
job_config.create_disposition = 'CREATE_IF_NEEDED'
# Set configuration.query.writeDisposition
job_config.write_disposition = 'WRITE_APPEND'
# Start the query
job = bigquery_client.query(sqlQuery, job_config=job_config)
# Wait for the query to finish
job.result()
returnMsg = 'Created table {} .'.format(table_name)
return returnMsg
except Exception as e:
errorStr = 'ERROR (create_table_as_select): ' + str(e)
print(errorStr)
raise
#===========================================================
We can also run bigquery sql from the cloud shell:
#===========================================================
bq query "select string_field_10 as request, count(*) as requestcount
from logdata.accesslog group by request order by requestcount desc"
#===========================================================
#===========================================================# Create a dataset using bigquery API:
bigquery_client = bigquery.Client() #Create a BigQuery service object
dataset_id = 'my_dataset'
dataset_ref = bigquery_client.dataset(dataset_id) # Create a DatasetReference using a chosen dataset ID.
dataset = bigquery.Dataset(dataset_ref) # Construct a full Dataset object to send to the API.
dataset.location = 'US' # Specify the geographic location where the new dataset will reside.
Remember this should be same location as that of source data set from where we are getting data to run a query
# Send the dataset to the API for creation. Raises google.api_core.exceptions.AlreadyExists if the Dataset already exists within the project.
dataset = bigquery_client.create_dataset(dataset) # API request
print('Dataset {} created.'.format(dataset.dataset_id))
#===========================================================
No comments:
Post a Comment