Who is the Client

A US-based Fortune 500 departmental store chain with over $20B annual sales and more than 1000 stores.

The Challenge

The client runs a comprehensive e-commerce portal to sell millions of products, which generates vast data in the form of transactional data, behavioral data, and more. This valuable data is captured and stored in various platforms and tools and is later used to derive actionable business insights.

However, the client struggled with data issues stemming from format mismatches, unavailability of specific data components, and more such issues that resulted in poor data quality. The client could not utilize this massive data present in systems across the enterprise to drive critical business decisions. They wanted to develop and implement a robust data quality framework to improve the quality of captured data and use it for business analysis.

The Solution

GSPANN built a roadmap to implement data quality framework, breaking it down into quarterly milestones:

  • Defined an audit framework and data project roadmap capabilities, participated in technology selection, and set up the environment.
  • Connected the audit framework with Griffin.
  • Defined KPI thresholds to build alert and trigger mechanism for common desktop environment and SLA-based automation.
  • Defined a data quality scoring system for business analysis, system analysis, source systems, and enterprise data warehouse, and re-architected the data quality business rules.

The user defines the measure and flow of filters in the data analytics/data quality (DA/DQ) framework in the new process flow, followed by the job scheduler sending events to execute the job flow. Then, the data analytics framework performs a REST API call to the job scheduler to get execution information for the concluded job run. Finally, the data is stored in the MySQL database.

  • The DA/DQ framework finds and triggers a Griffin job linked to the current flow. Griffin sends an HTTP request to Apache Livy that starts the Spark App using measure.jar and measure.json from the request.
  • Spark executes measure.jar files, reads data from the source, writes metrics into sinks, and sends a callback to the DA/DQ framework with Yarn's application ID.
  • The DA/DQ framework gets a callback from measure.jar, reads metrics from Elasticsearch, and sends notifications if any rules or measures fail.

Business Impact

  • The implemented data quality framework helped the client identify errors during the data ingestion process and rectify them during audits. This, in turn, improved the quality of data. It allowed the client’s technology team to correct millions of data rows daily, and the data is now of better quality in terms of completeness, consistency, validity, and accuracy.
  • This new data governance model drastically reduced the risk of incompetent business decision-making based on low data quality. It also saved the time and resources getting wasted on the pursuit of a wrong strategy built on incorrect data.

Technologies Used

Dataproc. A fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters
BigQuery. A fully-managed, serverless data warehouse that enables scalable, cost-effective, and fast analysis over petabytes of data
Apache Spark. An open-source distributed general-purpose cluster-computing framework
Apache Airflow. An open-source workflow management platform
Azkaban. A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Apache Griffin. An open-source data quality solution for big data that supports both batch and streaming modes
Apache Livy. A service that enables easy interaction with a Spark cluster over a REST interface
Hadoop Distributed File System. A distributed file system designed to run on commodity hardware
Apache Hive. A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed

Related Capabilities

Utilize Actionable Insights from Multiple Data Hubs to Gain More Customers and Boost Sales

Unlock the power of data insights buried deep within your diverse systems across the organization. We empower businesses to effectively collect, beautifully visualize, critically analyze, and intelligently interpret data to support organizational goals. Our team ensures good returns on the big data technology investment with effective use of the latest data and analytics tools.

Do you have a similar project in mind?

Enter your email address to start the conversation