Overview of the Google Cloud Platform (GCP) Cloud Dataflow Product

Below is an overview of the Google Cloud Platform (GCP) Cloud Dataflow product.

Knowledge Graph | Text | Top

^*Double-tap to expand/collapse an item. Left-tap & drag to move. Mouse-wheel/pinch to zoom.

Knowledge Graph | Text | Top

Google Cloud Platform (GCP) Cloud Dataflow Product Overview

Google Cloud Platform (GCP)

A fully managed data processing service, built on Apache Beam

GCP Cloud Dataflow Doco

Can export Dataprep to a Dataflow template

Apache Beam

A portable data processing programming model

Features

Open source

A UNIFIED programming model for processing both batch and streaming data

Pipelines are PORTABLE - can be executed on multiple execution environments

EXTENSIBLE - new SDKs, IO connectors and transformation libraries can be created and shared

Concepts

Pipelines - a set of steps/transforms; starts with a read from a source; ends with a write to a sink

Languages supported - Java, Python or Go

Step / Transform - elastically scaled; applied on a PCollection, resulting in another PCollection; give each transform a unique name within pipeline - can be reused and appears in the UI console step

In Java - pipeline.apply('step name', TextIO.Read.from("gs://.../input-*.csv.gz").apply(new TransformClass()).apply(TextIO.Write.to("gs://..."); p.run();

In Python - (pipeline | 'step name' >> beam.io.ReadFromText('gs://.../input-*.csv.gz') | beam.FlatMap(lambda line: myfunction(line)) | beam.io.WriteToText('gs://...') ) p.run()

PCollection - an unbounded parallel collection / list; not just in-memory

Sources

Sinks

SDK - provides a library of transformations and data connectors to sources and sinks

Model - a portable representation that is created from your code

Runner - passes off your model to an execution environment, to be run by an engine

Engine - e.g. Cloud Dataflow

Workers

Features

No-ops - little maintenance overhead

Build on reliable Google infrastructure

Auto-scaling of workers

Integrated with Stackdriver for monitoring and alerting

No lock-in because Apache Beam can be run elsewhere

Serverless - managed resource provisioning, performance tuning, pipeline reliability

Auto-heal in the event of faults with workers

Re-balance automatically to best utilise workers

Dataflow Templates

Can run pipeline locally or on cloud - in Python or Java

To associate a timestamp with inputs, it automatically uses PubSub message publishing timestamp ; otherwise you need to use c.outputWithTimestamp(...)

Ingesting data into a pipeline - from wildcarded files in file system, Cloud Storage, BigQuery (returns a TableRow), Pub/Sub

Writing data out - to file system, Cloud Storage, BigQuery, Pub/Sub

Can prevent sharding with .withoutSharding() - but not recommended - forces writing on a single machine; ok for very small file scenario

Actions

Step 1 - Receive a job

Step 2 - optimise the execution graph of the model to remove inefficiencies

Step 3 - schedule and distributes work to workers, scaling as necessary

MapReduce

Tip: Prefer Combine over GroupByKey as it knows the commutative and associative nature and can use multiple workers whereas GroupByKey has only one worker per key

Approach to streaming

Extract data

Apply a time window

Group by or Combine

Apply aggregate function

Windowing - group by time with c.outputWithTimestamp() instead of c.output() and:

Fixed Windows: .apply("window", Window.into(FixedWindows.of(Minutes(2)))

Sliding Windows: .apply("window", Window.into(SlidingWindows // .of(Duration.standardMinutes(2)) // .every(Duration.standardSeconds(20))))

In Python

Use beam.Map( lambda word: (word, len(word)) ) for 1:1 relationships between input and output - and return a tuple

Use beam.FlatMap( lambda word: my_generator(line, searchTerm) ) for non 1:1 relationships between input and output, typically with a generator that 'yield's a returning value

GroupBy and Combine

Combine - to aggregate - Sum, Mean, etc

GroupByKey a tuple # grouped = cityAndZipCodes | beam.GroupByKey()

salesAmounts | Combine.globally(sum) to combine items in a collection and do an aggregation/sum

salesRecords | Combine.perKey(sum) to combine tuples by the first item in the tuple and do an aggregation/sum

In Java

"Parallel Do" ParDo acts on one item at a time - like a Map operation

Multiple instances of the class on many machines

Should NOT contain state

Good for

Filtering (choosing which inputs to emit)

Converting one Java type to another

Extracting parts of an input (e.g. fields of TableRow)

Calculating values from different parts of input

Side inputs - more than 1 collection to process - czmap = p.apply("toview", View.asMap()); ... .apply("..", ParDo.withSideInputs(czmap))

GroupBy and Combine

Create a key-value pair KV class (in parallel on each machine with ParDo), then do an apply(GroupByKey.<String, type>create()) to create a PCollection<String, Iterable<type>>

Combine - to aggregate - Sum, Mean, etc

Combine.globally(new Sum.SumDoubleFn()) to aggregate/sum all the PCollection<Double> items

Combine.<String, Double, Double>perKey(new Sum.SumDoubleFn()) to aggregate/sum per key

Exactly-Once - using your own message identifier - as opposed to the internal message id

Specify unique id when publishing to Pub/Sub: msg.publish(data, myid="1234") // or // p.apply(PubsubIO.Write(topic).idLabel("myid"))

Tell Dataflow the idLabel: p.apply(PubsubIO.readStrings().fromTopic(topic).idLabel("myid"))

Constructs & questions for out-of-order data processing pipelines

What results are calculated? Transformations

Where in event time are results calculated? Event-time windowing

When in processing time are results materialised? Watermarks, triggers and allowed lateness

How do refinements of results relate? Accumulation modes

Concepts

Time

Event Time - related to the source of the event

Processing Time - related to the engine processing the event message

System Lag Dataflow Metric - the current maximum duration that an item of data has been awaiting processing, in seconds; the amount of time data elements are waiting to be processed since they "arrived" in the input of the transformation step

Elements Added Dataflow Metric (under output collections) - how many data elements exited this step

Elements Added Dataflow Metric (for the Read PubSub Msg step of the pipeline) - the number of Pub/Sub messages read from the topic by the Pub/Sub IO connector

Watermark - a heuristic of completeness, computed from the processing stream - the skew - tracks how far behind the processing engine is from the ideal of instantly processing each message

Data watermark age - the age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline

GCP Metrics doco

Steaming Data - balancing tradeoffs

Windowing model - supports unaligned event time windows: fixed, sliding, sessions

Triggering model - associates the output times of results to a runtime characterisitc of the pipeline

Incremental processing model - integrates retractions and updates intot he windows and triggering models