2c per gb per month 99.5% availability (5 out 1000 failures possibly) 11x9’s durable
object store (replace only), not a file store (can’t open the file for writing)
bucket names are globally unique
can name a bucket like a domain name - e.g. files.[your name].com if you can prove you own your domain - create a CNAME record to point your subdomain to the storage bucket
max size of a single object is 5 TB
Uses: - all static web content - e.g. stylesheets, images, videos, javascript files,
region - a data centre in the world - two or more zones; most have 3 zones; some has 4 zone - independant power supply etc.
Region 99.9% avaailability, 2.3 cents / GB-month multi-region 99.95% us or asia or europe, 2.6 cents / GB Dual-region - 99.95% iowa (middle of us) and south carolina (east)
Storage classes - Standard 2.3c/GB-month storage cost; no min storage duration - Nearline 1.6c/GB-month + 10c/GB-month retrieval cost - cheaper if accessed less than once per month; min duration 30 days - Coldline 0.6c/GB-month + 5c/GB-month retrieval - if accessed less than once per year; min duration 90 dayss
https://cloud.google.com/storage/docs/storage-classes Understand how you are being charged - not remember the specific prices Min storage duration is charged, even if you delete it immediately - for nearline and coldline!
Access - Set permissions at bucket level (bucket policy only) - Set permissions at both object-level and bucket-level
Encryption - all deta is encrypted by default in all google services - can’t turn it off - google managed key - customer managed key via Google Cloud Key Management Service (KMS) - customer supplied keys - created and managed by you, and uploaded
Retention policy - min time before objects can be deleted or changed after uploaded
FEATURES - object metadata - encryption - versioning - gsutil ls -a # show all files, including all old versions of the file - lifecycle management - lifecycle rules: based on age, creation date, storage class, blah then do an action, like set it to nearline, coldline, delete it - change notifications - use Cloud Functions to do something with the object when it is uploladed etc - security - bucket permissions - add members - email account, allusers, - group, domain, user email, project - read or owner permissions
- temp access to buckets
gcloud iam service-accounts keys create ~/key.json --iam-account [email protected]
gsutil signurl -d 10m ~/key.json gs://super-secure-bucket/noir.jpg
Static content - Shared files - allUsers read access - https://storage.googleapis.com/bucketname/filename - map bucket to your domain - prove your ownership with a TXT record, and add CNAME for you subdomain to c.storage.googleapis.com - option on the bucket - “edit website configuration” to specify Main page and 404 page - http not https - create a load balancer (under network services) backend config: pointing to your storage bucket, enable Cloud CDN optioanlly, frontend config: http or https protocol, provide a certificate (have google create for free (let’s encrypt) or upload own) - create A record to point to load balancer
CDN is ON EXAM - purpose of worldwide DO EXERCISE TO DEPLOY WEBSITE TO STORAGE https://labs.roitraining.com/labs/758-03-gcp-cloud-storage/index.html
gcloud init
gsutil ls # shows buckets
gsutil ls gs://mybucket
# Upload to bucket
gsutil cp localfile gs://mybucket # copy
gsutil rsync -r -x ".git/*" . gs://BUCKET_NAME/
# -r recursive copy
# -x exclude files in the .git folder
# Set bucket to public-read
gsutil -m acl set -R -a public-read gs://BUCKET_NAME
# -m multithreaded
# -R recursive
# -a all object versions
# Version control
## Get status of versions ("Suspended" = not enabled, "Enabled")
gsutil versioning get gs://BUCKET_NAME/
gsutil versioning set on gs://BUCKET_NAME/
## List versions of the file
gsutil ls -al gs://BUCKET_NAME/myfile
$ gsutil ls -la gs://adam1/test.txt
3 2019-10-15T14:15:24Z gs://adam1/test.txt#1571148924339107 metageneration=1
6 2019-10-15T14:17:25Z gs://adam1/test.txt#1571149045157341 metageneration=1
TOTAL: 2 objects, 9 bytes (9 B)
web console gsutil will upload multiple files in parallel, resumable in case of network interruption, json api
bigquery data transfer service
bucket to bucket scheduled - Createa transfer job into a jucket from storage bucket, a3 bucket, …,
Charged for netwrok egress
costs of delivering content from a bucket vs. the CDN https://cloud.google.com/storage/pricing#network-pricing https://cloud.google.com/cdn/pricing
Prefer Cloud Storage, unless you need an actual file system!
must be attached to a VM
compute engine - create a VM
ssh install apache http server cd /var/www/html
under disks - has a “standard persistent disk”, in same zone as the VM
snapshot a disk
create disk based on snapshot
uses:
min 10GB to 64TB?
can add/create more disks - SSD or standard magenetic
DO EXERCISE https://labs.roitraining.com/labs/793a-gcp03-disks-and-snapshots/index.html#0
################# ################# ADAM HERE: ################## #################
Key management - gogole managed - no config required - KEK rotated frequently - DEK rotation less often - data re-encryption required at least once every 5 years - customer managed - manage via Google Cloud KMS - customer supplied - manage outside of Google. - don’t lose the key! - Better to encrypt file before uploading it, rather than using a customer supplied key!
https://cloud.google.com/kms/docs/quickstart - create a keyring and add keys to it - symmetric, asymmetric - rotation period, auto generated or own https://cloud.google.com/kms/docs/algorithms#key_purposes
$1400 per month just for the instance in us-central-1 noSQL fully managed not transactional like cassandra / hbase wide-column store
number of nodes dictates network capacity
Config: - storage type : SSD, HDD (slower, cheaper) - cluster id - region & zone - choose number of nodes - min of 3 for prod - add a replicated cluster in a different region/zone
Performance - Based on current node count and storage type. - Adding a cluster increases read throughput but not write throughput. When writing, Cloud Bigtable uses the additional throughput for replication.
Price - 1 cluster of 3 nodes AUD $1922/month - plus 1TB HDD $35/month
Schamas - tables - row key - columns can be grouped in column families - store all data about an entity in a single row - updates to a single row are implicitly transactional - transactions are only within one row - multiple rows not supported - No indexes (unless you create your own table to be that index, or create a better row key…)
EXAM - question on how to design row keys - counter is a bad row key - all new rows go to the last tablet - use a hash? - timestamp is a bad row key - last tablet - GUID is good - random and distributed - but not helpful for a search based on an attribute of the data - Design based on the likely query predicate performed - good example: highway number - mile marker / sensor number - timestamp
https://cloud.google.com/bigtable/docs/schema-design#row-keys
Metadata - Table is assigned multiple tablets to store metadata - tablets are distributed over different disks
Use BigTable - when You need a place to store a huge amount of data (100TB plus) and you need extremely fast random access to individual records.
Replication guidance
Replication for Cloud Bigtable copies your data across multiple regions, enabling you to isolate workloads and increase the availability and durability of your data. Learn more https://cloud.google.com/bigtable/help/replication/overview
Replication cannot be paused. All writes are automatically replicated to all clusters.
Replication uses a primary-primary configuration with eventual consistency.
Replicating writes between clusters requires additional CPU for each cluster in the instance.
Provision enough nodes for each cluster to keep the cluster's average CPU utilization under the recommended threshold for your current number of clusters and app profile policy. For instances with 1 cluster or multiple clusters and single-cluster routing, keep average CPU utilization under 70%. For instances with 2 clusters and multi-cluster routing, keep CPU utilization under 35%. For instances with more than 2 clusters and multi-cluster routing, see additional configuration guidance.
Routing and other client access patterns are configured by the application profile. If needed, you can create and customize additional profiles.
Datastore has been migrated to Firestore no sql ACID transactional
indexes created by default on every property - so there are “secondary indexes” and more!
Firestore was purchased, Google liked it better than Datastore, but wanted to consolidate them
All datastore was migrated to Firestore in Datastore mode.
Runs on top of a shared instance of Spanner!! Same availability!
Config: - Mode - Native mode (use for all new development) - all features, offline support and real-time sync. API is like Mongo db. - Datastore mode - - location - multi-region - eur, nam 99.999% (same as spanner) ~17c / GB-month + reads/write (1GB free) - single region - syd 99.99% (same as spanner) ~8c / GB-month + reads/write (1GB free)
1GB storage + 50k entity reads + 50k small ops : DAILY FREE tier!! https://cloud.google.com/datastore/pricing
so if small website data, then Firestore is the cheapest option!!
relational - datastore terminolgy - tables - kinds - rows/records - entities - fields - properties - primary key - key - relationship - entity group (for faster reads on related data) - primary-foreign keys - ancestor paths (composite key points back to parent)
Firestore - uses Firebase API - like Mongo - not supported with older App Engine rntimes - supports offline mobile devices and synch
firestore - dtaastore terminology - collections - kinds - documents - entities - key-value pairs - properties - document name - key - sub-documents - entity groups
https://cloud.google.com/firestore/docs/quickstart-servers
DO THE FIRESTORE EXERCISE EXAM - https://quiz.roitraining.com/de-quiz4.html
managed, in-memory redis caching (key-value) - SET/GET no sql scale 1GB to 300GB instances
Config:
basic tier - single instnace - $35 / month
standard tier - has failover replica in another zone - 99.9% availability - $46 / month
choose RAM 1GB - 300GB
open-source
tutorial: try.redis.io (not needed for exam)
secure with an internal IP address only - available only from within same VPC - IAM roles
EXAM! Using the Price Calculator, estimate the cost of running a 10TB database using Cloud SQL, Spanner, Datastore, and Bigtable https://cloud.google.com/products/calculator/ Do your best to estimate the values for each product Use the documentation to help with your estimates Compare with https://cloud.google.com/products/calculator/#id=a0518a39-ae29-4bd6-868b-162dc3e6fe64
In order or expense:
Spanner $1600 / month 3 nodes Bigtable $1400 / month for cluster of 3 nodes MemoryStore $40/month Cloud SQL min $9/month 1 machine instance Firestore/Datastore 1GB storage free tier Cloud storage 5 GB US-regional free /month BigQuery - 10GB free storage, then 2c/GB-month for 1st 3 months, then 1c. 10TB left for a year = $1500
Uses: - dynamic web apps - load balancer -> compute engine instances in different zones -> Master and replica Cloud SQL in different zones
Managed service for relational dbs in GCP
Managed compute instances - you choose VM machine type and disk
Instance of: - MySQL - PostgresSQL - SQL Server is in beta/preview - and will be more expensive due to license
Default user password is required
= Connectivity Public IP by default - but there is a firewall preventing internet access - you need to autorise a network / open the firewall to that IP address range Private IP - only available within GCP network
= Machine type & Storage - choose a machine with enough memory to hold your largest table - shared-core machines, standard machines, high memory machines - storage type - SSD, HDD - storage cpacity 10GB to 30GB - Start small and select “Enable automatic storage increase”.
SSD or HDD Capacity between min 10GB to 30TB (to 30720 GB) - can’t be decreased. Start small and select “Enable automatic storage increase”. Increased capacity => more disk throughput Mb/s
Automated backups & HA - window
Availabiltiy - single zone (1 db server) - high avail (regional) - auto failover to another zone in the same region
Maintenance schedule - maint window - any window (fine for HA), otherwise select time and day of week - maint timing - any
Know the Advanced Options FOR THE EXAM
Standard installations of MySQL and PostgresSQL - no special config performed
https://cloud.google.com/sql/docs/
https://cloud.google.com/sql/docs/mysql/quickstart
https://cloud.google.com/sql/pricing#2nd-gen-pricing
https://cloud.google.com/spanner/docs/ https://cloud.google.com/spanner/docs/best-practice-list
Fully managed relational, strongly consistent, ACID transactions Scales globally, distributed horizonatally across regions Number of nodes determines capacity Designed for extremely large relational databases
expensive! Min of US $1600 / month plus storage and usage costs!
Single region Sydney: $1.22 per node-hour * 3 nodes for prod + $0.405 per GB-Month + general network egress (e.g. 19c / GB)
us-central1: $0.90 per node-hour + $0.30 per GB-Month
Multi-regional nam-eur-asia1: $9 per node-hour + $0.90 / GB-month
nam3 (north viginia / South Carolin) $3 per node-hour + $0.50 / GB-Month
Regional 99.99% availability
Multi-regional 99.999% availability
Config: - name - regional / multi-regional - each region has different number of replicas and availablitly - diff sets of read-write replicas - nam = north america - eur = europe - asia
USD $42k per month for multi-regional - nam-eur-asia1
number of nodes
For optimal performance in this configuration, we recommend provisioning enough nodes to keep high priority CPU utilization under 45%.
uses atomic clocks attached to the servers to manage transactions
ansi sql 2011
17c per gb per month on the cluster
Hadoop cluster
Config - name - region & zone - cluster mode - Single node 1 master, 0 wokers - Standard - 1 master, N workers (typical for production) - High Availabiltiy - 3 masters, N workers - Master node - VM settings: compute size - n1-standard-?, HDD size - default is 500GB, min 15GB - Worker Nodes - VM settings: compute size - n1-standard-?, HDD size - default is 500GB, min 15GB - Number of nodes - Advanced - Component Gatway - Enable accces to the web interfaces of default and select optional components of the cluster - Gives access to Web interfaces (tab in cluster details) for YARN Resource Manager, HDFS Name Node, MapReduce job history
Hive & SparkSQL & Presto all allow you to submit SQL to the master - Hive to Hadoop - SparkSQL to Spark - Dataproc is both Hadoop and Spark!
Google would say don’t use Presto.
EXAM - reduce costs with preemptble - store data in cloud storage - use Dataproc to easily migrate legacy hadoop jobs…
Dataproc creates an HDFS cluster, but don’t use it for long-lived storage
Store the data to analyze in Google Cloud Storage Cloud Storage is cheaper (HDFS cluster created on Persistent Disks) Only pay for what you use, not for what you allocate
Separating storage and compute allows the cluster to be disposable Can size the cluster for specific jobs Delete the cluster as soon as possible Only pay for it while it is working Can recreate the cluster in a couple minutes
Can run existing Hadoop and Spark jobs on Dataproc Use Cloud Storage not HDFS so the Dataproc cluster can be deleted without deleting the data Make the attached disks small
Move HBase workloads to Bigtable to reduce administration
Script creation of Dataproc clusters and jobs Delete the cluster when the jobs are completed to reduce cost Machines are billed in 1-minute increments with a 10-minute minimum Consider using preemptible instances for some of the workers
** Choose BigQuery over Dataproc for greenfield **
Could run Jupyter on the Dataproc cluster
Google V1 - GFS distributed file system + MapReduce became HDFS and Hadoop and given to Apache
Google V2 - Colossus replaced GFS as Cloud Storage - Allows data stored off the cluster and brings more flexibilty to cluster management
- Dremel replaced MapReduce as BigQuery (similar to Hive, Spark SQL, Presto)
- Master parses SQL statement and turns it into a job
- SQL is well suited to MapReduce functionality
Pig is like Python compared with Java - verbosity Hive allows SQL to be submitted to Hadoop cluster - like BigQuery Apache Spark - effectively Hadoop v2, doing more in-memory to optimise Hadoop
typically the answer for anything mentioning “Data warehousing”! move HBase workloads to BigQuery
Purpose - Data warehouse - Data analytics - ML
Uses: - data warehousing - get data in cloud storage first -> bigquery - cloud storage -> dataflow for ETL -> bigquery - log processing: - Cloud Logging Log Collection export to cloud storage once per day -> dataflow -> bigquery - Cloud Logging Log Collection stream to pub/sub -> dataflow -> bigquery - shopping cart analysis: - dataflow/cloud storage -> dataproc/dataflow -> bigquery or application (app engine/compute engine / container engine)
BigQuery has its own storage system Most efficient storage when running BigQuery queries
Projects contain datasets, which contain tables There is no limit to the number of tables in a dataset There is no limit to the number of rows in a table
Each field in a table is stored separately Makes querying more efficient because only fields in a query are read
All data is encrypted by default All data is compressed for faster processing
Tables must have a schema which defines field names and data types Schemas support nested, hierarchical data (Array) Schemas support repeated fields For example, an Orders table can have a field called Details which is an array of records which provides details about each order
Repeated, nested fields allow querying parent-child relationships without needing to join two tables Joins are expensive in BigQuery since there are no indexes
Storage Pricing - Storage 2c per GB-month for first 3 months, then 1c per GB-month - 10TB left for a year = $1500
ANSI SQL 2011 for SELECT queries
Can write user-defined functions to manipulate data in SQL or JavaScript
Can query from a number of data sources: - BigQuery Storage - Google Cloud Storage - Google Bigtable - Google Drive
Completely NoOps
- No need to provision anything
- No need to tune queries
Query Pricing
Queries are charged one of two ways: on-demand and flat-rate - EXAM
On demand at $5/TB of data processed with 1TB free per month
1TB = $5 10TB = $50 1000TB (1 Petabyte) = $5000
With flat-rate pricing, you pre-purchase BigQuery capacity - Capacity is measured in “slots”, a unit of processing in BigQuery - Run as many queries as you can, but you will never go over your purchased slots
EXAM - multiple questions on correct BigQuery queries - which is the right one
There are two dialects of BigQuery SQL, Legacy and Standard Legacy was the original, replaced in 2016
Standard SQL is ANSI 2011 SQL compliant Very minor differences exist due to the platform
Standard SQL includes data manipulation language (DML) statements INSERT, UPDATE, DELETE Strict quotas apply to DML statements
STRUCT function creates a composite field. ARRAY_AGG function creates an array. UNNEST flattens the array of objects so they are queryable ARRAY_LENGTH(col) - count
Access is granted to BigQuery using IAM Members and Roles
** Table access is granted at the Dataset level All Tables within a Dataset share the same permissions For public Datasets, grant Viewer role to allAuthenticatedUsers
Members at minimum need Job User role to run queries
BigQuery Roles - Admin - Data Editor - RW data - Data Owner - RW to data, plus grant access to other users and groups by setting IAM policies - Data Viewer - User - run jobs, create datasets, list tables, save queries. No default access to data - Job User - create and run jobs, but not access data - the least set of permissions possible
No indexes
Denormalize parent-child relationships Store child records as repeated records with the parent row Allows querying related data without a join
Queries that were already run are cached Data returned from the cache is free
For very large tables, create smaller temp tables when possible
Partition tables when data is accumulated on a regular basis For example, daily logs, daily sales, etc. Can specify the data range of partitions to query, avoiding a table scan
Don’t group by fields with a very large number of different values
Prefer built-in functions to UDFs if possible
Partition by ingestion time - every partition is stored in a separate file/table - psuedo column created - _partitiontime - WHERE _partitiontime = OR Between
BigQuery can analyze data directly from: - Bigtable - Cloud Storage - Google Drive
Must define a table schema for the external source
Not as efficient as BigQuery native storage Native storage puts fields in separate tables so queries only have to scan the fields in the query, not the whole table
Useful for ETL and denormalization jobs Read data from external source, manipulate it, and load it into BigQuery
See: https://cloud.google.com/bigquery/external-data-sources
Fully managed cluster of VM instances is created to start a job
Use Cases: Not for ETL on 1GB data - not enough data to make it worthwhile - takes longer to start the cluster than run the job! Meant for Terabytes of ETL! Meant for windowing/aggregation/complex event handling of streaming data
Perform a set of actions in a chain Data from one or more sources is read into the pipeline Actions are performed on the data to manipulate or transform it The manipulated results are sent as output from the pipeline
Actions within a data pipeline can run at the same time (concurrently)
At scale, multiple machines can participate to get the pipeline done faster
MapReduce jobs are examples of pipelines at scale Multiple nodes read data from disk and perform an initial map step Data is then organized by key (the shuffle step) Keyed data is processed separately in parallel (the reduce step)
Dataflow is the most optimised way of running Apache Beam on GCP
Batch data flows process big chunks of data at set intervals Analyzing daily logs Importing monthly sales Periodic data conversions
Streaming data flows process data as it is accumulated Analyzing traffic to determine the quickest route What tweets are trending right now What products are selling today
Dataflow pulls together a number of GCP services to run data flows It is the job of the Dataflow service to optimize execution
Cloud Storage is used as a staging area for data flow code Can also be used for data input and output
BigQuery tables can be used for input and output BigQuery is frequently the preferred tool to analyze data flow output
Cloud Compute instances are used to execute data flows The Dataflow service determines how many instances are required Google’s high-speed network is used to move data around
Pub/Sub is used to provide streaming data flows
NOTE: for streaming, there will be at least one VM that you are paying for!! Not the cheapest way to do streaming! (cheapest is PubSub -> Cloud Function or AppEngine)
BUT Dataflow has windowing mechanisms, aggregation, complex event handling capability - which is the reason why you would use it for streaming - anomaly detection - use ML to create notification if number of messages in a window is larger/smaller than usual
Dataflow has built-in support for three types of windows Shuffles arriving messages into the correct window based on event time Fixed time windows are the simplest Each window is given an interval Sliding time windows have an interval and a period The interval defines how long a window collects data for The period defines how often a new window starts Session windows define windows on areas of concentrated data Uses a key to combine data into groups Like sessions in a web application
Time-based windows can be based on with the event time or process time.
Pub/Sub is a fully managed, massively scalable messaging service
It allows messages to be sent between independent applications
Can scale to millions of messages per second
Pub/Sub messages can be sent and received via HTTP(S)
Pub/Sub supports multiple senders and receivers simultaneously
Pub/Sub is a global service Messages are copied to multiple zones for greater fault tolerance Uses dedicated resources in every region for fast delivery worldwide
Pub/Sub is secure All message are encrypted at rest and in transit
Topic names are in the form:
projects/
Config: - Delivery type - Pull - e.g. Dataflow jobs, a Kubernetes service; subscriber calls pull() then acknowledge() - Push - HTTPS web service / Cloud Function, App Engine app + std environment (scales to zero); ACK implied by a 200 OK response code
Pricing: - charged by GB of message data
Event time is when something actually occurred The time an order is placed for example This is when the message is published to Pub/Sub Process time is when the system observes the event When the subscriber receives the Pub/Sub message Obviously, process time is always after event time Usually, this is a short period of time Sometimes, a system problem will delay the process time The difference between event and process times can vary significantly Event occurs in a mobile application when the user is on an airplane
Visual tool for cleaning and manipulating data
For massive datasets only (it runs on Dataflow - think Terabytes) Serverless on top of Dataflow
A service that creates the Apache AirFlow server for you Orchestrate workloads across GCP, on-premises & other clouds
Create environment - name - node count - machine type - service account
Navigation -> Big Data -> Composer
Apache Airflow - Workflow orchestration engine - python
AirFlow UI - history of jobs
Workflow - coded in Python - start date - job valid from - each step is in a DAG - each step is an Python or Bash Operator - specify order of steps in python: step1 >> step2 >> step3
Built-in connectors for many GCP services
If bug, just overwrite the file in storage
Standard environment (scales to zero)
Great for cheap push of a steaming data from PubSub
40c per million request 2 million requets free per month + CPU usage
API to help classify and redact sensitive data Helps customers meet their compliance obligations Works with image or text data
Data can be in GCP, other clouds, or on-prem Built-in support for BigQuery, Datastore, and GCS
90-plus built-in detectors for common sensitive data items E.g., credit card numbers
Detectors can be customized to classify/redact new data items E.g., social security numbers in a particular country’s format
Managed Jupyter service
Interactive tool for data analysis, machine learning, and many other tasks
Based on Jupyter, an open-source project for creating iPython Notebooks
Supports many languages: Python, JavaScript, Shell scripts, HTML, SQL, etc.
Integrated with GCP, so access to other services like BigQuery are simple
Integrates with Git to enable collaboration and sharing notebooks
Runs in a Compute Engine Virtual Machine No extra charge for Datalab beyond the machine cost
Manage Datalab instances using the Google Cloud SDK in Cloud Shell To create a Datalab instance:datalab create instance-name To stop an instance without deleting it:datalab stop instance-name To restart a stopped instance:datalab connect instance-name To delete an instance:datalab delete instance-name
Can pass in options to choose the machine type with the create command
For more options and details on using the SDK, see https://cloud.google.com/datalab/docs/how-to/lifecycle
Creates a VM Stop the VM and no compute cost, but still storage cost
Advantage is that - Google Cloud SDK is already installed - Already authenticated
%%bq query
SELECT ...
%javascript
function blah() { }
datastudio.google.com
SaaS
Business reports / BI / charts & graphing
Free
In EXAM - ML basics - basic terminolgoy - basic model - GCP tools to help you
Target / Label Features / inputs Example - a row of data of inputs and label/target
Feature Tuning 1. gather data 2. Split data into train/test 3. Train / draw line
RMSE - root of the mean of the summed squared error MAE - mean absolute error - using absolute value instead of squaring
Fit evaluate predict
Gradient Descnet - find the best fit line / function / algorithm - if RMSE goes down, move in same direction - if RMSE goes up, move in other direction
Different features are more important than others - hence each has a weighting Sum of weighted features gives the prediction
Algorithms - Linear Regression - value along a line - Classification - 2 or more categories - Single layer neural models - Neural networks - collection of layers
No code in the EXAM
Supports Python, Go, JavaScript
Navigation -> AI Platofrm -> Jobs
Cloud-based service used to build machine learning models Used by Google to build their own models Fully-managed NoOps service Supports training on CPUs, GPUs, and TPUs Can be used to deploy models as services Command-line API for submitting jobs, deploying models, and making predictions
gcloud ml-engine jobs submit training my_job –module-name trainer.task –staging-bucket gs://my-bucket –package-path /my/code/path/trainer –packages additional-dep1.tar.gz,dep2.whl gcloud ml-engine jobs submit prediction JOB –data-format=DATA_FORMAT –input-paths=INPUT_PATH,[INPUT_PATH,…] –output-path=OUTPUT_PATH –region=REGION gcloud ml-engine models create MODEL [–enable-logging] [–regions=REGION,[REGION,…]] [GCLOUD_WIDE_FLAG …]
Allows you to specify the number and type of training machines no-ops and can automate hyperparameter tuning deploy multiple versions of same model concurrently
Vision API - face - objects - safe search detection - adult, racy, spoof, violence
NLP - entities, syntax, categories
Translate API
Video Intelligence API - indexing
Speech API - recognition
AutoML Vision, - Image classification - Object detection
NLP
Translation
Uses Transfer Learning - extend a pre-trained model - allows you to require less data
Provides a simple user interface for: Creating datasets Training and evaluating modes Using your trained models to make predictions
Trains model and gives a REST API to call the trained model.
Linear regression Binary logistic regression Multiclass logistic regression for classification
Must SELECT / mark one field as the label
global load balancer -> CDN + storage bucket single region (website use case)
spanner multi-regional access
horiz scalable often schemaless varying support for SQL reporting can b harder
key-value stores - Redis - SimpleDB
Document stores - Data is stored in some standard format like XML or JSON, BSON - Nested and hierarchical data can be stored together - i.e. a single read instead of reading from multiple tables - MongoDB, CouchDB, and DynamoDB are examples - Firestore
Wide-column stores - Key identifies a row in a table - Column data types and number of columns can be different within each row - All info about an entity is within the same row - Cassandra and HBase are examples - Bigtable
NoSQL is NOT blob storage!
Availability is the percentage of time the data can be accessed - Achieved by deploying services to multiple zones and/or regions Durability defines the likelihood of losing data because of a hardware failure - Achieved by writing data to multiple physical disks; The more disks, the higher the durability
High availabilty - customer visible, real-time Cost and durability - backups, archives - sacrifice availability or performance to make it cheaper Cost is only consideration - cache or temporary data store that can be destroyed and recreated from other sources - typically in memory - want it to be cheap
Transactional consistency – When a transaction completes, all operations must be successful or all are rolled back The data must conform to all rules specified by the database The state of data is known by all nodes in a distributed system e.g. financial payment, price of stock
Eventual consistency – After data is updated, the system guarantees that all copies of the data in a distributed system will “eventually” be the same It is possible that requests to different nodes will return different results after an update e.g. product reviews, social networking status updates
Strong consistency – All nodes in a distributed system that have received the same update will be in the same state once the transaction completes - all nodes would thus return the same result to a query
Volume – there is a lot of data Need many drives to store the data
Velocity – if there is a lot, then you must be collecting it at a fast rate Need to be able to write to drives very quickly Need to be able to get the data back very quickly
Variety – the data is coming for many different sources Web pages Text files Logs PDFs Databases Etc.
Practice exam 20 questions: https://quiz.roitraining.com/de-practice-exam.html