Home / 

Enterprises relentlessly try to use the ever-increasing streams of data to their advantage. IDC predicts that the world of data will grow from 33 ZB in 2018 (1 Zettabyte = 1 Million Petabytes) to 175 ZB by 2025. The IT departments are expected to build faster and resilient data pipelines that are fault-tolerant and can autonomously manage tasks, like scaling up the infrastructure, to provide just enough processing power to meet the business SLAs consistently. They are expected to allocate storage to dynamically meet the growing data needs while keeping the overall budget into consideration.

Evolution of Global Data

Hence, data engineers can concentrate on writing the code for the pipeline, given that their primary goal is to find the ways to leverage their enterprise’s data assets to improve decision-making, open new growth opportunities, contribute to customer acquisition/retention strategies, and better serve the ever-changing data requirements.

This blog provides an overview of agile data processing options available to enterprises on the Google Cloud Platform (GCP) for data analytics. It includes information on various data processing options available on GCP, i.e., data ingestion at scale, building reliable data pipelines, architecting modern data warehouses or data lakes, and providing platforms for analytics. While the scope of this document is not to give an overview of the entire list of options, it would touch upon the most relevant services, so that this can help the reader in developing strategies for migrating or building a streaming analytics pipeline, starting from ingestion till consumption.

Data Processing on Google Cloud Platform

As mentioned above, the data processing options available on GCP is part of its broader data analytics offering. Let us take a look at each high-level component of the pipeline.

Agile Data Options in GCP

The first step is to get the data into the platform from various disparate sources, generally referred to as Data Ingestion. After getting data into the data pipeline, data is enriched and transformed for analytical purposes. This processed data is typically stored in a modern data warehouse, or data lake, where it is finally consumed by enterprise users for operational and analytical reporting, machine learning, and advanced analytics, or AI use cases. This whole pipeline can be automated using an orchestrator like Cloud Composer.

In a serverless (fully managed) data analytics platform, unlike a traditional data analytics platform, the infrastructure and platform nuances, like monitoring, tuning, utilization, resource provisioning, scaling, reliability, etc., is moved away from the application developer and bundled into the hands of the platform provider. Let’s take a deeper dive into each of the stages and the corresponding relevant services.

Data Ingestion

Cloud Pub/Sub: Event-driven data ingestion and data movement mechanism follows a publish/subscribe pattern to improve reliability. Scalable up to 100GB/sec with consistency, this is sure to satisfy the scale of almost any enterprise. The data can be retained for several days, although it is set for 7 days by default. This service is deeply integrated with other components of the GCP analytics platform.

Data Processing

Cloud Dataflow: It is used for streaming data processing in real-time. The traditional approach to building data pipelines was to create a separate codebase (Lambda architecture patterns) for batch, micro-batch, and stream processes. Cloud Data can create a unified programming model so that the users can process the workloads with the same code base. It also simplifies operations and management.

Apache Beam: An open-source unified model with a set of SDKs that can define and execute data pipelines. It gives a variety of flexibility to the users to use languages likes Java, Scala, and Python, to develop their code. Furthermore, Dataflow is built on Apache Beam SDKs. Building your codebase in Beam even allows developers to run or port the code to other processing engines like Spark, Flink, Dataflow, etc.

Dataproc: It is a fully managed Apache Hadoop and Spark service, which allows the user to use all familiar open-source Hadoop tools like Spark, Hadoop, Hive, Tez, Presto, Jupyter, etc. and tightly integrate it to the services within a GCP ecosystem. It gives flexibility to rapidly define clusters, define machine types to be used for master and data nodes.

There are two types of Dataproc clusters that can be provisioned in GCP. The first one is the ephemeral cluster (cluster is defined when a job is submitted, scaled up or down as needed by the job, and is deleted once the job is completed). The second one is the long-standing cluster, where the user creates a cluster (comparable to an on-premise cluster) with a defined number of the minimal and maximum number of nodes. Here, the jobs will be executed within the constraints, and when the jobs are completed, the cluster scales down to the minimum constraint. Depending on the use case and processing power needed, this gives the flexibility to define the type of clusters.

Dataproc is an enterprise-ready service with high availability and high scalability. It allows both horizontal scaling (scales to the tune of 1000s of nodes per cluster) as well as vertical scaling (configurable compute machine types, GPUs, Solid-state drive storages, and persistent disks).

Data Warehouse

BigQuery: It is a modern data warehouse offered as part of the Google Cloud Platform, which is an ANSI SQL compliant. It is a completely managed and serverless environment and is a petabyte-scale data warehouse. Here, data is securely encrypted and is durable. BigQuery natively supports real-time streaming as well as machine learning using BigQuery ML.

ETL Workflows

Cloud Data Fusion: It is a fully managed enterprise data integration service for building and managing data pipelines. Developers, analysts, and data scientists can use it for visually creating data pipelines, testing, debugging, and deploying. Natively, it also supports everyday data engineering tasks, like data cleansing, matching, de-duping, blending, transforming, etc. It helps in running data pipelines on-scale on GCP and operationalize data pipelines.

Workflow Orchestration

Cloud Composer: A fully managed, workflow orchestration service. It is built on an Apache Airflow open source project and enables the users to author, schedule, and monitor end-to-end data pipelines. Composer provides an interface for graphical representations of the workflow, which helps in smooth management of the workflow. The pipelines are configured as directed acyclic graphs (DAGs) using Python. Cloud Composer natively integrates well with all GCP dataand analytics services and also provides the ability to connect the pipeline through a single orchestration tool irrespective of where the workflow resides - on-premise or on the cloud.

Reference Architectures

A typical architecture for implementing a batch or streaming pipeline would include, but not limited to, the following components.

  • An Orchestrator - It can leverage Google Cloud Pub/Sub or Google Cloud Dataflow.
  • Data Ingestion mechanism - It can leverage Google Cloud Dataflow for ingesting streaming data from real-time message queues or batch datasets from sources like databases, file storage, etc. Creating a unified code base leveraging Apache Beam’s development model can let us reuse the code across various pipelines.
  • Dataproc Vs. Dataflow - If we have an environment that has dependencies on tool stacks in Hadoop/Spark ecosystem, Dataproc might be a better choice for processing, querying, streaming, and machine learning needs. However, for greenfield implementations, Cloud Dataflow would be able to satisfy most of the processing requirements.
  • Analytical Storage and Processing - Google BigQuery would be an ideal choice for storing the staging and final datasets.
  • ETL Transformations - It can be performed using Cloud Dataflow with embedded SQLs.
  • Data consumption - Interactive dashboards or other data/analytical consumption needs can be natively addressed using Google Looker, Data Studio or well-integrated third-party tools like Tableau.

A Data Pipeline using Cloud Dataflow

All these components are examples of fully-managed services on GCP, and one can design and implement pipelines by swiftly using them. Below is an example of a data pipeline using Cloud Dataflow.

Another variant to this, as explained above, is for environments with dependencies on Hadoop/Spark tools, it is recommended to use data pipelines using Dataproc or Dataflow.

A Data Pipeline using Cloud Dataflow (Dependencies on Hadoop_Spark tools)

Picking the right options for data processing and analytical requirements depends on a variety of factors, including talent, cost, time to market, processing volumes, future product capabilities, etc. Below are a few general selection criteria while making decisions around real-time data processing needs.

  1. Batch or streaming: If an enterprise has a requirement or use cases for decision-making data in real-time in the near future, it makes sense to invest in creating a unified code base that supports both batch and streaming. Products like Apache Beam, Dataflow, etc. help in this regard since they provide the flexibility of choosing the programming language familiar to the engineers as well as the ability to port this code to other platforms like Spark, Flink, etc., in the future.
  2. Fully Serverless or Configurable Autoscaling: Maintenance and operations are a huge part of any IT organization's cost. While cloud platforms can do most of this job, IT organizations need to establish the right governance to track and monitor the utilization of cloud resources to keep their ROI in check. For example, leveraging features like ephemeral or long-standing clusters, would require a more in-depth analysis of each workload as well as the workforce available within the enterprise. It’s crucial to pick and choose services that need a serverless environment and define limits for services that can eat the operational cost. Using the cost calculator on Google Cloud is a starting point for understanding this.
  3. Migrate or Greenfield: Most organizations have started their big data journey a few years back. If the need is for migrating data pipelines to the cloud without changing the tech stack, it would sometimes be a more comfortable choice to use services, like Dataproc, which allows the users to use the traditional open source technology stack like Hadoop, Spark, etc., along with a lot less operational overhead. Migrating the traditional Hadoop/Spark pipelines to GCP Dataproc could use a lift-shift strategy, while if the objective is to establish new pipelines (batch as well as streaming), it makes sense to evaluate serverless or configurable services using Dataflow.