Saturday, December 7, 2019
GCP Study notes 1: history of GCP, and why I think it's better than AWS?
Study notes from the 1st class from Coursera: Google Cloud Platform Big Data and Machine Learning Fundamentals
Summary: google's own tool: MapReduce, Bigtable, Dremel.
Corresponding GCP product: Bigquery, Cloud Datastore
In 2002, Google created GFS, Google File System to handle sharding and storing petabytes of data at scale. GFS is the foundation for cloud storage and also for what would become BigQuery managed storage.
One of Google's next challenges was to figure out how to index the exploding volume of content on the web. To solve this in 2004, Google invented a new style of data processing known as MapReduce to manage large-scale data processing across large clusters of commodity servers. MapReduce programs are automatically parallelized and executed on a large cluster of these commodity machines.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System(GFS),Doug Cutting and Mike Cafarella created Apache Hadoop a year after the google white paper. Hadoop has moved far beyond its beginnings in web indexing, and is now used in many industries.
The core of Apache Hadoop consists of :
1. a storage part, known as Hadoop Distributed File System (HDFS).
2. a processing part Hadoop MapReduce, which is an implementation of the MapReduce programming model for large-scale data processing.
3. Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
Hbase or MongoDB or Big Table: NoSQL database
As Google's needs grew, we faced the problem of recording and retrieving millions of streaming user actions with high throughput, it's real time, the traditional SQL database can't handle that easily. Then google invented Cloud Bigtable which was an inspiration behind Hbase or MongoDB. HBase is an open-source, non-relational distributed database modeled after Google's Bigtable. MongoDB is a NoSQL database program, MongoDB uses JSON-like documents with schema.
Bigtable (google's own database) ==> Cloud Datastore (GCP product service)
Google's reasons for developing its own database "Bigtable" include scalability and better control of performance characteristics. Its development began in 2004, and is now used by a number of Google applications, such as web indexing, gmail, google map, youtube etc. GCP product: Cloud Datastore is built upon Google's Bigtable, is a highly scalable, fully managed NoSQL database service.
Between 2008 and 2010, Google started to move away from MapReduce to process and query large data sets, and instead they started moving towards new tools. Tools like Dremel. Dremel is the query engine used in Google's BigQuery service, it is a distributed system developed at Google for interactively querying large datasets.
In other words, google first invented MapReduce around 2004, then other company product Apache Hadoop MapReduce and HSFS is invented, then google gave up MapReduce to invent Dremel in 2008, Google beat themselves first to beat other competitors!
Dremel took a new approach to big data processing where Dremel breaks data into small chunks called shards, and compresses them into a columnar format across distributed storage. It then uses a query optimizer to farm out tasks between the many shards of data and the Google data centers full of commodity hardware to process a query in parallel and deliver the results.
Google continued to innovate to solve its big data and ML challenges, and created Colossus as a next-generation distributed data store, Spanner as a planet scale relational database. Flume and Millwheel for data pipelines, Pub/Sub for messaging, TensorFlow for machine learning plus there are specialized TPU hardware we saw earlier, and Auto ML that's going to come later.
This is amazing part I found out, google AI published white papers by year: 2015(383), 2016(491), 2017(587), 2018(716), 2019(754). At the same time, I checked AWS published paper, it's more for business deployment, Personally I don't think AWS has that much their own technology base, compared with google. So AWS did a very good job in migrating the traditional business tool onto cloud service, which make them the lead in the market(temporary), however, in the long run, I believe google has better chance winning the competition.
In summary, the advantage of GCP over AWS:
1. Businesses in the retail sector and other verticals that directly compete with Amazon are moving away from AWS because they do not wish to “feed the beast” by contributing to a competitor’s bottom line.
2. Kubernetes & DevOps: AWS offers Kubernetes services, but Google developed Kubernetes. Google Cloud Platform users get to access new Kubernetes features and deployments immediately, while rollouts on AWS are delayed. Google Kubernetes Engine (GKE), generally considered the gold standard for running Kubernetes, is easier to use than Amazon EKS, especially for developers who are new to Kubernetes or containers.
3. AI/ML: AWS integrates with popular big data tools and offers a serverless computing option, but Amazon’s core competency is retail. Google’s core competency is artificial intelligence and machine learning. You can see how many tech paper previously published by gcp, compared with AWS.
4. Cybersecurity: By default, GCP encrypts all data in transit between Google, its customers, and its data centers, as well as all data in GCP services and stored on persistent disks. In AWS, data encryption is available, but not by default.
5. Cost: Budget-friendly pricing is one of GCP’s main selling points, with its Cloud Platform Committed Use and Sustained Use Discounts offering significant cost savings over AWS, with no upfront costs.
HDBSCAN is a clustering algorithm. It extends DBSCAN（Density-Based Spatial Clustering of Applications with Noise）by converting it into a hi...
Welcome to Statistics world! Here are the lecture notes we are commonly used. Download Lecture Notes and Homework Lecture01: Introdu...
How do we install packages in python? How do we send emails in python? How do we load/output csv file in/to Pytyon How do we output impo...
• Study notes 1: Python general, dataframe, SQL, Plot How do we install packages in python? How do we send emails in python? How d...