[GCP ML engineer certification Day2]

Updated: December 17, 2020

Recommending Products using Cloud SQL and Spark

Google Cloud Products

Cloud SQL - managed relational database (MySQL & Postgres)
- Familiar
- Flexible pricing
- Managed backups
- Automatic replication
- Connect from anywhere
- Fast connection
- Google security
Cloud dataproc - managed environment on which you can run Apache Spark

Why we migrate an on-premise application to Google Cloud platform?

to avoid the challenges that are associated with utilizing and tuning on-premise clusters.
when you move an on-premise application to Google Cloud, you’re also moving from dedicated storage to off-cluster storage with Google Cloud Storage.

The core pieces of a recommendation system

data
model
infrastructure to train and serve recommendations to users

A core tenet of machine learning is to let the model learn for itself what the relationship is.

query

Hand-coded rules are hard to maintain.
Why can’t we train a machine learning model to basically provide a ranking of these links?
That’s exactly what Google itself did internally with a deep learning model called Rank Brain.

Machine Learning = exmaples, not rules

Concept of Recommendation System

recommendation

Ingest the ratings of all the houses that have already been done by our users when we showed them specific houses.
- explicit ratings : showed the house to the user in the past and they’ve clicked four stars after seeing the house details.
- implicit ratings : user spent a lot of time looking at the website corresponding to this property.
Training
Pick the top five rated houses that they haven’t already seen.

How the recommendation models work

The model is based on two things.

based on your other ratings, what have you rated other houses.
based on other people’s rating of this particular house.

This idea of using user ratings of a particular house and users like you helps convey the basic premise of how recommendation models work.

Things that you have to consider

how to find the users who are most like you
how many users to consider
how to weight the different factors such as the overall popularity of the items you have in common

This can be done by seeing what parameters help predict if you intentionally withheld ratings best.

Things that machine learning model consider

cluster

Who is this user like?
Is this a house that people tend to rate highly?

The predicted rating is a combination of both these factors.

Last problem = Infrastructure

How often and where will you compute the predicted ratings? Batch or Streaming

Batch : don’t need to update the rental recommendations every time a new rating appears in our system
Streaming : compute the rating that every user will give to every house -> do it in a big data platform like Apache Hadoop

Finally, where will you store the computed ratings?

data

We probably want to power a web application with these recommendations and we don’t want to compute recommendations when the user reads a webpage.
We want to precompute these recommendations (batch jobs)

So we need a transactional way to store the predictions.

While the user is reading these predictions, we can update the predictions table as well.
Just store the data in a relational database management system, an RDBMS like MySQL.

How to migrate the recommendation system from on-premises to Google Cloud Platform

spark

Use SparkML, but instead of doing it on-premises, we’ll run the machine learning job on Cloud Dataproc.
Then store the ratings in an RDBMS in Cloud SQL, because this is a relatively small dataset of five recommendations for every user.

cloud

use Cloud Storage as a global file system.
Use Cloud SQL as an RDBMS (relational database management system) for transactional relational data that you access through SQL.
Use Datastore as a transactional No-SQL object-oriented database.
Use Bigtable for high-throughput No-SQL append-only data, typical use case for Bigtable is sensor data for connected devices for example.
Use BigQuery as a SQL data warehouse to power all your analytics needs.

visual

If your data is unstructured, like images or audio, use Cloud Storage.
If your data is structured and you need transactions, use Cloud SQL or Cloud Datastore, depending on whether you want your access pattern to be SQL or No-SQL, and by No-SQL we mean a key-value pair.
In other words, you’ll be trying to search for data based on a single key, use Datastore
if you’d be finding data using SQL use Cloud SQL.
Cloud SQL generally plateaus out at a few gigabytes.
So if you wanted transactional database that is horizontally scalable so that you can deal with data larger than a few gigabytes, or if you need multiple databases so you want them spread globally, use Cloud Spanner. (If you’ll need multiple databases, either because you have a lot of data or because your application needs to be transactional across different continents)
If your data is structured and you want Analytics, consider either Bigtable or BigQuery.
Use Bigtable if you need real-time high-throughput applications.
Use BigQuery if you want Analytics over petabyte-scale datasets.

Share on

Twitter Facebook LinkedIn

Jeongho Shin (Leo)