cloud storage

2c per gb per month 99.5% availability (5 out 1000 failures possibly) 11x9’s durable

object store (replace only), not a file store (can’t open the file for writing)

bucket names are globally unique

can name a bucket like a domain name - e.g. files.[your name].com if you can prove you own your domain - create a CNAME record to point your subdomain to the storage bucket

max size of a single object is 5 TB

Uses: - all static web content - e.g. stylesheets, images, videos, javascript files,

region - a data centre in the world - two or more zones; most have 3 zones; some has 4 zone - independant power supply etc.

Region 99.9% avaailability, 2.3 cents / GB-month multi-region 99.95% us or asia or europe, 2.6 cents / GB Dual-region - 99.95% iowa (middle of us) and south carolina (east)

Storage classes - Standard 2.3c/GB-month storage cost; no min storage duration - Nearline 1.6c/GB-month + 10c/GB-month retrieval cost - cheaper if accessed less than once per month; min duration 30 days - Coldline 0.6c/GB-month + 5c/GB-month retrieval - if accessed less than once per year; min duration 90 dayss

https://cloud.google.com/storage/docs/storage-classes Understand how you are being charged - not remember the specific prices Min storage duration is charged, even if you delete it immediately - for nearline and coldline!

Access - Set permissions at bucket level (bucket policy only) - Set permissions at both object-level and bucket-level

Encryption - all deta is encrypted by default in all google services - can’t turn it off - google managed key - customer managed key via Google Cloud Key Management Service (KMS) - customer supplied keys - created and managed by you, and uploaded

Retention policy - min time before objects can be deleted or changed after uploaded

FEATURES - object metadata - encryption - versioning - gsutil ls -a # show all files, including all old versions of the file - lifecycle management - lifecycle rules: based on age, creation date, storage class, blah then do an action, like set it to nearline, coldline, delete it - change notifications - use Cloud Functions to do something with the object when it is uploladed etc - security - bucket permissions - add members - email account, allusers, - group, domain, user email, project - read or owner permissions

IAM Roles
- storage object viewer - 4 permissions - get & list for resourcemanager.projects & storage.objects
- creator - 3 permissions - resorucemanageerget/list
- admin - 9 permissions

Signed URLs - ON THE EXAM

- temp access to buckets

gcloud iam service-accounts keys create ~/key.json --iam-account [email protected]
gsutil signurl -d 10m ~/key.json gs://super-secure-bucket/noir.jpg

Static content - Shared files - allUsers read access - https://storage.googleapis.com/bucketname/filename - map bucket to your domain - prove your ownership with a TXT record, and add CNAME for you subdomain to c.storage.googleapis.com - option on the bucket - “edit website configuration” to specify Main page and 404 page - http not https - create a load balancer (under network services) backend config: pointing to your storage bucket, enable Cloud CDN optioanlly, frontend config: http or https protocol, provide a certificate (have google create for free (let’s encrypt) or upload own) - create A record to point to load balancer

CDN is ON EXAM - purpose of worldwide DO EXERCISE TO DEPLOY WEBSITE TO STORAGE https://labs.roitraining.com/labs/758-03-gcp-cloud-storage/index.html

gcloud init

gsutil ls               # shows buckets
gsutil ls gs://mybucket

# Upload to bucket

gsutil cp localfile gs://mybucket   # copy

gsutil rsync -r -x ".git/*" . gs://BUCKET_NAME/
    # -r recursive copy
    # -x exclude files in the .git folder


# Set bucket to public-read

gsutil -m acl set -R -a public-read gs://BUCKET_NAME
    # -m multithreaded
    # -R recursive
    # -a all object versions

# Version control

## Get status of versions ("Suspended" = not enabled, "Enabled")
gsutil versioning get gs://BUCKET_NAME/

gsutil versioning set on gs://BUCKET_NAME/

## List versions of the file
gsutil ls -al gs://BUCKET_NAME/myfile

$ gsutil ls -la  gs://adam1/test.txt
         3  2019-10-15T14:15:24Z  gs://adam1/test.txt#1571148924339107  metageneration=1
         6  2019-10-15T14:17:25Z  gs://adam1/test.txt#1571149045157341  metageneration=1
TOTAL: 2 objects, 9 bytes (9 B)

Cloud Transfer Services

web console gsutil will upload multiple files in parallel, resumable in case of network interruption, json api

bigquery data transfer service

Storage Transfer Service

bucket to bucket scheduled - Createa transfer job into a jucket from storage bucket, a3 bucket, …,

Transfer Appliance

offline data transfers
request transfer applicance
you manage encryption key
100TB, 480TB

Cloud CDN

Charged for netwrok egress

costs of delivering content from a bucket vs. the CDN https://cloud.google.com/storage/pricing#network-pricing https://cloud.google.com/cdn/pricing

Persistent disk

Prefer Cloud Storage, unless you need an actual file system!
must be attached to a VM
compute engine - create a VM
ssh install apache http server cd /var/www/html
under disks - has a “standard persistent disk”, in same zone as the VM
snapshot a disk
- on a schedule
- stored in cloud storage - regional or multi-regional
- create disk based on snapshot
- uses:
  - backup a persistent disk
  - move a disk to another region or zone
  - make a persistent disk larger
min 10GB to 64TB?
can add/create more disks - SSD or standard magenetic
- size of disk changes the IOPs

DO EXERCISE https://labs.roitraining.com/labs/793a-gcp03-disks-and-snapshots/index.html#0

################# ################# ADAM HERE: ################## #################

Encryption

files broken into chunks
chunks encrypted with Data Encryption Key (DEK)
DEK is encrypted with Key Encryption Key (KEK)
Chunks stored multiple times on physical disks

Key management - gogole managed - no config required - KEK rotated frequently - DEK rotation less often - data re-encryption required at least once every 5 years - customer managed - manage via Google Cloud KMS - customer supplied - manage outside of Google. - don’t lose the key! - Better to encrypt file before uploading it, rather than using a customer supplied key!

Cloud Key Management Service (KMS)

https://cloud.google.com/kms/docs/quickstart - create a keyring and add keys to it - symmetric, asymmetric - rotation period, auto generated or own https://cloud.google.com/kms/docs/algorithms#key_purposes

apache beam

b for batch
eam for stream

bigtable

$1400 per month just for the instance in us-central-1 noSQL fully managed not transactional like cassandra / hbase wide-column store

number of nodes dictates network capacity

Config: - storage type : SSD, HDD (slower, cheaper) - cluster id - region & zone - choose number of nodes - min of 3 for prod - add a replicated cluster in a different region/zone

Performance - Based on current node count and storage type. - Adding a cluster increases read throughput but not write throughput. When writing, Cloud Bigtable uses the additional throughput for replication.

Price - 1 cluster of 3 nodes AUD $1922/month - plus 1TB HDD $35/month

Schamas - tables - row key - columns can be grouped in column families - store all data about an entity in a single row - updates to a single row are implicitly transactional - transactions are only within one row - multiple rows not supported - No indexes (unless you create your own table to be that index, or create a better row key…)

EXAM - question on how to design row keys - counter is a bad row key - all new rows go to the last tablet - use a hash? - timestamp is a bad row key - last tablet - GUID is good - random and distributed - but not helpful for a search based on an attribute of the data - Design based on the likely query predicate performed - good example: highway number - mile marker / sensor number - timestamp

https://cloud.google.com/bigtable/docs/schema-design#row-keys

Metadata - Table is assigned multiple tablets to store metadata - tablets are distributed over different disks

Use BigTable - when You need a place to store a huge amount of data (100TB plus) and you need extremely fast random access to individual records.

Replication guidance

Replication for Cloud Bigtable copies your data across multiple regions, enabling you to isolate workloads and increase the availability and durability of your data. Learn more https://cloud.google.com/bigtable/help/replication/overview

Replication cannot be paused. All writes are automatically replicated to all clusters.
Replication uses a primary-primary configuration with eventual consistency.
Replicating writes between clusters requires additional CPU for each cluster in the instance.
Provision enough nodes for each cluster to keep the cluster's average CPU utilization under the recommended threshold for your current number of clusters and app profile policy. For instances with 1 cluster or multiple clusters and single-cluster routing, keep average CPU utilization under 70%. For instances with 2 clusters and multi-cluster routing, keep CPU utilization under 35%. For instances with more than 2 clusters and multi-cluster routing, see additional configuration guidance.
Routing and other client access patterns are configured by the application profile. If needed, you can create and customize additional profiles.

firestore / datastore

Datastore has been migrated to Firestore no sql ACID transactional

indexes created by default on every property - so there are “secondary indexes” and more!

Firestore was purchased, Google liked it better than Datastore, but wanted to consolidate them

All datastore was migrated to Firestore in Datastore mode.

Runs on top of a shared instance of Spanner!! Same availability!

Config: - Mode - Native mode (use for all new development) - all features, offline support and real-time sync. API is like Mongo db. - Datastore mode - - location - multi-region - eur, nam 99.999% (same as spanner) ~17c / GB-month + reads/write (1GB free) - single region - syd 99.99% (same as spanner) ~8c / GB-month + reads/write (1GB free)

1GB storage + 50k entity reads + 50k small ops : DAILY FREE tier!! https://cloud.google.com/datastore/pricing

so if small website data, then Firestore is the cheapest option!!

relational - datastore terminolgy - tables - kinds - rows/records - entities - fields - properties - primary key - key - relationship - entity group (for faster reads on related data) - primary-foreign keys - ancestor paths (composite key points back to parent)

Firestore - uses Firebase API - like Mongo - not supported with older App Engine rntimes - supports offline mobile devices and synch

firestore - dtaastore terminology - collections - kinds - documents - entities - key-value pairs - properties - document name - key - sub-documents - entity groups

https://cloud.google.com/firestore/docs/quickstart-servers

DO THE FIRESTORE EXERCISE EXAM - https://quiz.roitraining.com/de-quiz4.html

Memorystore

managed, in-memory redis caching (key-value) - SET/GET no sql scale 1GB to 300GB instances

Config:

basic tier - single instnace - $35 / month

standard tier - has failover replica in another zone - 99.9% availability - $46 / month

choose RAM 1GB - 300GB

open-source

tutorial: try.redis.io (not needed for exam)

secure with an internal IP address only - available only from within same VPC - IAM roles

Storage Relative Pricing

EXAM! Using the Price Calculator, estimate the cost of running a 10TB database using Cloud SQL, Spanner, Datastore, and Bigtable https://cloud.google.com/products/calculator/ Do your best to estimate the values for each product Use the documentation to help with your estimates Compare with https://cloud.google.com/products/calculator/#id=a0518a39-ae29-4bd6-868b-162dc3e6fe64

In order or expense:

Spanner $1600 / month 3 nodes Bigtable $1400 / month for cluster of 3 nodes MemoryStore $40/month Cloud SQL min $9/month 1 machine instance Firestore/Datastore 1GB storage free tier Cloud storage 5 GB US-regional free /month BigQuery - 10GB free storage, then 2c/GB-month for 1st 3 months, then 1c. 10TB left for a year = $1500

https://cloud.google.com/free

Cloud SQL

Uses: - dynamic web apps - load balancer -> compute engine instances in different zones -> Master and replica Cloud SQL in different zones

Managed service for relational dbs in GCP

Managed compute instances - you choose VM machine type and disk

regional only ***

Config options

Database type - MySQL, PostgresSQL
instance id
Root password
Region / Zone - keep it close to your sevices
Database version

Instance of: - MySQL - PostgresSQL - SQL Server is in beta/preview - and will be more expensive due to license

Default user password is required

= Connectivity Public IP by default - but there is a firewall preventing internet access - you need to autorise a network / open the firewall to that IP address range Private IP - only available within GCP network

= Machine type & Storage - choose a machine with enough memory to hold your largest table - shared-core machines, standard machines, high memory machines - storage type - SSD, HDD - storage cpacity 10GB to 30GB - Start small and select “Enable automatic storage increase”.

Tune cores 1 shared to 64 CPUS - controls the network throughput MB/s. 10 CPUs gives max network of 2000 MB/s
Tune memory 3.75 to 6.5 GB

SSD or HDD Capacity between min 10GB to 30TB (to 30720 GB) - can’t be decreased. Start small and select “Enable automatic storage increase”. Increased capacity => more disk throughput Mb/s

Automated backups & HA - window

Availabiltiy - single zone (1 db server) - high avail (regional) - auto failover to another zone in the same region

Maintenance schedule - maint window - any window (fine for HA), otherwise select time and day of week - maint timing - any

Know the Advanced Options FOR THE EXAM

Standard installations of MySQL and PostgresSQL - no special config performed

https://cloud.google.com/sql/docs/

https://cloud.google.com/sql/docs/mysql/quickstart

https://cloud.google.com/sql/pricing#2nd-gen-pricing

spanner

https://cloud.google.com/spanner/docs/ https://cloud.google.com/spanner/docs/best-practice-list

Fully managed relational, strongly consistent, ACID transactions Scales globally, distributed horizonatally across regions Number of nodes determines capacity Designed for extremely large relational databases

expensive! Min of US $1600 / month plus storage and usage costs!

Single region Sydney: $1.22 per node-hour * 3 nodes for prod + $0.405 per GB-Month + general network egress (e.g. 19c / GB)

us-central1: $0.90 per node-hour + $0.30 per GB-Month

Multi-regional nam-eur-asia1: $9 per node-hour + $0.90 / GB-month

nam3 (north viginia / South Carolin) $3 per node-hour + $0.50 / GB-Month

Regional 99.99% availability

Multi-regional 99.999% availability

Config: - name - regional / multi-regional - each region has different number of replicas and availablitly - diff sets of read-write replicas - nam = north america - eur = europe - asia

USD $8k per month
USD $42k per month for multi-regional - nam-eur-asia1
number of nodes
- 1 node gives you no availability SLA - good for dev!
- min 3 nodes for a prod env
- recommended 1 node per 2TB of data
- 10TB = 5 nodes
- 100TB = 50 nodes
- 1000TB (1 petabyte) = 500 nodes

For optimal performance in this configuration, we recommend provisioning enough nodes to keep high priority CPU utilization under 45%.

uses atomic clocks attached to the servers to manage transactions
ansi sql 2011

Hadoop

17c per gb per month on the cluster

Dataproc

Hadoop cluster

Config - name - region & zone - cluster mode - Single node 1 master, 0 wokers - Standard - 1 master, N workers (typical for production) - High Availabiltiy - 3 masters, N workers - Master node - VM settings: compute size - n1-standard-?, HDD size - default is 500GB, min 15GB - Worker Nodes - VM settings: compute size - n1-standard-?, HDD size - default is 500GB, min 15GB - Number of nodes - Advanced - Component Gatway - Enable accces to the web interfaces of default and select optional components of the cluster - Gives access to Web interfaces (tab in cluster details) for YARN Resource Manager, HDFS Name Node, MapReduce job history

Preemptible worker nodes
- temp; 80% cheaper; may not get it; taken after after 24 hours

Hive & SparkSQL & Presto all allow you to submit SQL to the master - Hive to Hadoop - SparkSQL to Spark - Dataproc is both Hadoop and Spark!

Google would say don’t use Presto.

EXAM - reduce costs with preemptble - store data in cloud storage - use Dataproc to easily migrate legacy hadoop jobs…

Dataproc creates an HDFS cluster, but don’t use it for long-lived storage

Store the data to analyze in Google Cloud Storage Cloud Storage is cheaper (HDFS cluster created on Persistent Disks) Only pay for what you use, not for what you allocate

Separating storage and compute allows the cluster to be disposable Can size the cluster for specific jobs Delete the cluster as soon as possible Only pay for it while it is working Can recreate the cluster in a couple minutes

Can run existing Hadoop and Spark jobs on Dataproc Use Cloud Storage not HDFS so the Dataproc cluster can be deleted without deleting the data Make the attached disks small

Move HBase workloads to Bigtable to reduce administration

Script creation of Dataproc clusters and jobs Delete the cluster when the jobs are completed to reduce cost Machines are billed in 1-minute increments with a 10-minute minimum Consider using preemptible instances for some of the workers

** Choose BigQuery over Dataproc for greenfield **

Could run Jupyter on the Dataproc cluster

Google Big Data History

Google V1 - GFS distributed file system + MapReduce became HDFS and Hadoop and given to Apache

Google V2 - Colossus replaced GFS as Cloud Storage - Allows data stored off the cluster and brings more flexibilty to cluster management

- Dremel replaced MapReduce as BigQuery (similar to Hive, Spark SQL, Presto)
    - Master parses SQL statement and turns it into a job
    - SQL is well suited to MapReduce functionality

Pig is like Python compared with Java - verbosity Hive allows SQL to be submitted to Hadoop cluster - like BigQuery Apache Spark - effectively Hadoop v2, doing more in-memory to optimise Hadoop

bigquery

typically the answer for anything mentioning “Data warehousing”! move HBase workloads to BigQuery

Purpose - Data warehouse - Data analytics - ML

Uses: - data warehousing - get data in cloud storage first -> bigquery - cloud storage -> dataflow for ETL -> bigquery - log processing: - Cloud Logging Log Collection export to cloud storage once per day -> dataflow -> bigquery - Cloud Logging Log Collection stream to pub/sub -> dataflow -> bigquery - shopping cart analysis: - dataflow/cloud storage -> dataproc/dataflow -> bigquery or application (app engine/compute engine / container engine)

BigQuery has its own storage system Most efficient storage when running BigQuery queries

Projects contain datasets, which contain tables There is no limit to the number of tables in a dataset There is no limit to the number of rows in a table

Each field in a table is stored separately Makes querying more efficient because only fields in a query are read

All data is encrypted by default All data is compressed for faster processing

Tables must have a schema which defines field names and data types Schemas support nested, hierarchical data (Array) Schemas support repeated fields For example, an Orders table can have a field called Details which is an array of records which provides details about each order

Repeated, nested fields allow querying parent-child relationships without needing to join two tables Joins are expensive in BigQuery since there are no indexes

Storage Pricing - Storage 2c per GB-month for first 3 months, then 1c per GB-month - 10TB left for a year = $1500

ANSI SQL 2011 for SELECT queries

Can write user-defined functions to manipulate data in SQL or JavaScript

Can query from a number of data sources: - BigQuery Storage - Google Cloud Storage - Google Bigtable - Google Drive

Completely NoOps
- No need to provision anything - No need to tune queries

Query Pricing

Queries are charged one of two ways: on-demand and flat-rate - EXAM

On demand at $5/TB of data processed with 1TB free per month

1TB = $5 10TB = $50 1000TB (1 Petabyte) = $5000

With flat-rate pricing, you pre-purchase BigQuery capacity - Capacity is measured in “slots”, a unit of processing in BigQuery - Run as many queries as you can, but you will never go over your purchased slots

EXAM - multiple questions on correct BigQuery queries - which is the right one

There are two dialects of BigQuery SQL, Legacy and Standard Legacy was the original, replaced in 2016

Standard SQL is ANSI 2011 SQL compliant Very minor differences exist due to the platform

Standard SQL includes data manipulation language (DML) statements INSERT, UPDATE, DELETE Strict quotas apply to DML statements

STRUCT function creates a composite field. ARRAY_AGG function creates an array. UNNEST flattens the array of objects so they are queryable ARRAY_LENGTH(col) - count

Security - EXAM!

Access is granted to BigQuery using IAM Members and Roles

** Table access is granted at the Dataset level All Tables within a Dataset share the same permissions For public Datasets, grant Viewer role to allAuthenticatedUsers

Members at minimum need Job User role to run queries

BigQuery Roles - Admin - Data Editor - RW data - Data Owner - RW to data, plus grant access to other users and groups by setting IAM policies - Data Viewer - User - run jobs, create datasets, list tables, save queries. No default access to data - Job User - create and run jobs, but not access data - the least set of permissions possible

Performance

No indexes

Denormalize parent-child relationships Store child records as repeated records with the parent row Allows querying related data without a join

Queries that were already run are cached Data returned from the cache is free

For very large tables, create smaller temp tables when possible

Partition tables when data is accumulated on a regular basis For example, daily logs, daily sales, etc. Can specify the data range of partitions to query, avoiding a table scan

Don’t group by fields with a very large number of different values

Prefer built-in functions to UDFs if possible

Partition by ingestion time - every partition is stored in a separate file/table - psuedo column created - _partitiontime - WHERE _partitiontime = OR Between

External Data / “Federated” Data

BigQuery can analyze data directly from: - Bigtable - Cloud Storage - Google Drive

Must define a table schema for the external source

Not as efficient as BigQuery native storage Native storage puts fields in separate tables so queries only have to scan the fields in the query, not the whole table

Useful for ETL and denormalization jobs Read data from external source, manipulate it, and load it into BigQuery

See: https://cloud.google.com/bigquery/external-data-sources

Dataflow

Fully managed cluster of VM instances is created to start a job

Use Cases: Not for ETL on 1GB data - not enough data to make it worthwhile - takes longer to start the cluster than run the job! Meant for Terabytes of ETL! Meant for windowing/aggregation/complex event handling of streaming data

Perform a set of actions in a chain Data from one or more sources is read into the pipeline Actions are performed on the data to manipulate or transform it The manipulated results are sent as output from the pipeline

Actions within a data pipeline can run at the same time (concurrently)

At scale, multiple machines can participate to get the pipeline done faster

MapReduce jobs are examples of pipelines at scale Multiple nodes read data from disk and perform an initial map step Data is then organized by key (the shuffle step) Keyed data is processed separately in parallel (the reduce step)

Dataflow is the most optimised way of running Apache Beam on GCP

Batch data flows process big chunks of data at set intervals Analyzing daily logs Importing monthly sales Periodic data conversions

Streaming data flows process data as it is accumulated Analyzing traffic to determine the quickest route What tweets are trending right now What products are selling today

Dataflow pulls together a number of GCP services to run data flows It is the job of the Dataflow service to optimize execution

Cloud Storage is used as a staging area for data flow code Can also be used for data input and output

BigQuery tables can be used for input and output BigQuery is frequently the preferred tool to analyze data flow output

Cloud Compute instances are used to execute data flows The Dataflow service determines how many instances are required Google’s high-speed network is used to move data around

Pub/Sub is used to provide streaming data flows

NOTE: for streaming, there will be at least one VM that you are paying for!! Not the cheapest way to do streaming! (cheapest is PubSub -> Cloud Function or AppEngine)

BUT Dataflow has windowing mechanisms, aggregation, complex event handling capability - which is the reason why you would use it for streaming - anomaly detection - use ML to create notification if number of messages in a window is larger/smaller than usual

Dataflow has built-in support for three types of windows Shuffles arriving messages into the correct window based on event time Fixed time windows are the simplest Each window is given an interval Sliding time windows have an interval and a period The interval defines how long a window collects data for The period defines how often a new window starts Session windows define windows on areas of concentrated data Uses a key to combine data into groups Like sessions in a web application

Time-based windows can be based on with the event time or process time.

PubSub

Pub/Sub is a fully managed, massively scalable messaging service It allows messages to be sent between independent applications
Can scale to millions of messages per second

Pub/Sub messages can be sent and received via HTTP(S)

Pub/Sub supports multiple senders and receivers simultaneously

Pub/Sub is a global service Messages are copied to multiple zones for greater fault tolerance Uses dedicated resources in every region for fast delivery worldwide

Pub/Sub is secure All message are encrypted at rest and in transit

Topic names are in the form: projects//topics/

Config: - Delivery type - Pull - e.g. Dataflow jobs, a Kubernetes service; subscriber calls pull() then acknowledge() - Push - HTTPS web service / Cloud Function, App Engine app + std environment (scales to zero); ACK implied by a 200 OK response code

Pricing: - charged by GB of message data

Event time is when something actually occurred The time an order is placed for example This is when the message is published to Pub/Sub Process time is when the system observes the event When the subscriber receives the Pub/Sub message Obviously, process time is always after event time Usually, this is a short period of time Sometimes, a system problem will delay the process time The difference between event and process times can vary significantly Event occurs in a mobile application when the user is on an airplane

Dataprep

Visual tool for cleaning and manipulating data

For massive datasets only (it runs on Dataflow - think Terabytes) Serverless on top of Dataflow

Composer

A service that creates the Apache AirFlow server for you Orchestrate workloads across GCP, on-premises & other clouds

Create environment - name - node count - machine type - service account

Cloud Storage bucket
- workflows uploaded here are automatically scheduled in AirFlow by composer

Navigation -> Big Data -> Composer

Apache Airflow - Workflow orchestration engine - python

AirFlow UI - history of jobs

Workflow - coded in Python - start date - job valid from - each step is in a DAG - each step is an Python or Bash Operator - specify order of steps in python: step1 >> step2 >> step3

Built-in connectors for many GCP services

If bug, just overwrite the file in storage

App Engine

Standard environment (scales to zero)

Cloud Function

Great for cheap push of a steaming data from PubSub

40c per million request 2 million requets free per month + CPU usage

Data Loss Prevention

API to help classify and redact sensitive data Helps customers meet their compliance obligations Works with image or text data

Data can be in GCP, other clouds, or on-prem Built-in support for BigQuery, Datastore, and GCS

90-plus built-in detectors for common sensitive data items E.g., credit card numbers

Detectors can be customized to classify/redact new data items E.g., social security numbers in a particular country’s format

Datalab

Managed Jupyter service

Interactive tool for data analysis, machine learning, and many other tasks

Based on Jupyter, an open-source project for creating iPython Notebooks

Supports many languages: Python, JavaScript, Shell scripts, HTML, SQL, etc.

Integrated with GCP, so access to other services like BigQuery are simple

Integrates with Git to enable collaboration and sharing notebooks

Runs in a Compute Engine Virtual Machine No extra charge for Datalab beyond the machine cost

Manage Datalab instances using the Google Cloud SDK in Cloud Shell To create a Datalab instance:datalab create instance-name To stop an instance without deleting it:datalab stop instance-name To restart a stopped instance:datalab connect instance-name To delete an instance:datalab delete instance-name

Can pass in options to choose the machine type with the create command

For more options and details on using the SDK, see https://cloud.google.com/datalab/docs/how-to/lifecycle

Creates a VM Stop the VM and no compute cost, but still storage cost

Advantage is that - Google Cloud SDK is already installed - Already authenticated

%%bq query

SELECT ...

%javascript

function blah() { }

Data Studio

datastudio.google.com

SaaS

Business reports / BI / charts & graphing

Free

Machine Learning

In EXAM - ML basics - basic terminolgoy - basic model - GCP tools to help you

Target / Label Features / inputs Example - a row of data of inputs and label/target

Feature Tuning 1. gather data 2. Split data into train/test 3. Train / draw line

RMSE - root of the mean of the summed squared error MAE - mean absolute error - using absolute value instead of squaring

Fit evaluate predict

Gradient Descnet - find the best fit line / function / algorithm - if RMSE goes down, move in same direction - if RMSE goes up, move in other direction

Different features are more important than others - hence each has a weighting Sum of weighted features gives the prediction

Algorithms - Linear Regression - value along a line - Classification - 2 or more categories - Single layer neural models - Neural networks - collection of layers

TensorFlow

No code in the EXAM

Supports Python, Go, JavaScript

Cloud AI Platform / ML Engine

Navigation -> AI Platofrm -> Jobs

Cloud-based service used to build machine learning models Used by Google to build their own models Fully-managed NoOps service Supports training on CPUs, GPUs, and TPUs Can be used to deploy models as services Command-line API for submitting jobs, deploying models, and making predictions

gcloud ml-engine jobs submit training my_job –module-name trainer.task –staging-bucket gs://my-bucket –package-path /my/code/path/trainer –packages additional-dep1.tar.gz,dep2.whl gcloud ml-engine jobs submit prediction JOB –data-format=DATA_FORMAT –input-paths=INPUT_PATH,[INPUT_PATH,…] –output-path=OUTPUT_PATH –region=REGION gcloud ml-engine models create MODEL [–enable-logging] [–regions=REGION,[REGION,…]] [GCLOUD_WIDE_FLAG …]

Allows you to specify the number and type of training machines no-ops and can automate hyperparameter tuning deploy multiple versions of same model concurrently

Pre-built ML Models

Vision API - face - objects - safe search detection - adult, racy, spoof, violence

create a Cloud Function that runs when an image is uploaded to Cloud Storage vision_client = vision.ImageAnnotatorClient() result = vision_client.safe_search_detection(blob) detected = result.safe_search_annotation if detected.adult == 5 or detected.violence == 5:

NLP - entities, syntax, categories

Sentiment
- score -1 to 1 negative or positive
- magnitude -

Translate API

Video Intelligence API - indexing

Speech API - recognition

AutoML

AutoML Vision, - Image classification - Object detection

NLP

Translation

Uses Transfer Learning - extend a pre-trained model - allows you to require less data

Provides a simple user interface for: Creating datasets Training and evaluating modes Using your trained models to make predictions

Trains model and gives a REST API to call the trained model.

BigQuery ML

Linear regression Binary logistic regression Multiclass logistic regression for classification

Must SELECT / mark one field as the label

Architecture

global load balancer -> CDN + storage bucket single region (website use case)

spanner multi-regional access

NoSQL

horiz scalable often schemaless varying support for SQL reporting can b harder

key-value stores - Redis - SimpleDB

Document stores - Data is stored in some standard format like XML or JSON, BSON - Nested and hierarchical data can be stored together - i.e. a single read instead of reading from multiple tables - MongoDB, CouchDB, and DynamoDB are examples - Firestore

Wide-column stores - Key identifies a row in a table - Column data types and number of columns can be different within each row - All info about an entity is within the same row - Cassandra and HBase are examples - Bigtable

NoSQL is NOT blob storage!

Availability / Durability

Availability is the percentage of time the data can be accessed - Achieved by deploying services to multiple zones and/or regions Durability defines the likelihood of losing data because of a hardware failure - Achieved by writing data to multiple physical disks; The more disks, the higher the durability

Use Cases

High availabilty - customer visible, real-time Cost and durability - backups, archives - sacrifice availability or performance to make it cheaper Cost is only consideration - cache or temporary data store that can be destroyed and recreated from other sources - typically in memory - want it to be cheap

Consistency

Transactional consistency – When a transaction completes, all operations must be successful or all are rolled back The data must conform to all rules specified by the database The state of data is known by all nodes in a distributed system e.g. financial payment, price of stock

Eventual consistency – After data is updated, the system guarantees that all copies of the data in a distributed system will “eventually” be the same It is possible that requests to different nodes will return different results after an update e.g. product reviews, social networking status updates

Strong consistency – All nodes in a distributed system that have received the same update will be in the same state once the transaction completes - all nodes would thus return the same result to a query

3 V’s of Big Data

Volume – there is a lot of data Need many drives to store the data

Velocity – if there is a lot, then you must be collecting it at a fast rate Need to be able to write to drives very quickly Need to be able to get the data back very quickly

Variety – the data is coming for many different sources Web pages Text files Logs PDFs Databases Etc.

ROI

Practice exam 20 questions: https://quiz.roitraining.com/de-practice-exam.html