Who is the Client

A US-based global Fortune 500 departmental store chain with over $20B annual sales and more than 1000 stores.

The Challenge

The client sells millions of lifestyle products to customers through its e-commerce website, contributing to their way of life. The website data related to sales, such as product name, payment method, and more, are stored in MySQL database, from where the client’s analytics team access it to measure and analyze sales performance and understand customers’ choices. These insights help the client plan new business initiatives such as discounts and promotions, to push sales.

However, the challenge was that the reports were generated only once a day using an internal system based on Hadoop that fetched data in batches rather than real-time. This led to gaps in data reporting that reflected in MySQL (source of truth) and the on-premises system. This lack of real-time reporting and data mismatches led to inept decision-making, hampering the client’s progression.

The Solution

GSPANN’s advanced analytics team examined the problem and conducted a migration from on-premises to Google Cloud Platform (GCP) and built a deployment pipeline. We also developed scripts and tables to create these data reports in a format desired by the leadership team. As a result of our implemented solution, well-segmented sales data is now available in real-time, in expected formats.

The key tasks undertaken:

  • Applied a Change Data Capture (CDC) approach to stream data from MySQL BinLog into Apache Kafka by implementing a complex logic.
  • Developed statistics report that provides information on the frequency of orders placed and sales of products in real-time, using the online portal.
  • Removed errors in logic and business transformations with a new approach related to data cleansing, denormalization of JSON data, and more.

To elaborate, we configured Maxwell Daemon to ingest the data in near real-time and stream the data from MySQL BinLog to Kafka broker (stored into BigQuery stage dataset tables.) Now, MySQL BinLog stores all operations performed on MySQL database, while Maxwell Daemon provides information of all operations from BinLog to Maxwell database tables. Based on the information available in Maxwell database tables, Kafka streaming jobs initiate the data pull from BinLog, which is segregated since each operation is performed on different tables.

Business Impact

  • Accurate hourly reports generated with no mismatches with the source—MySQL.
  • Access to real-time formatted data enabled the leadership to make quick and impactful business decisions.
  • Timely business decisions resulted in a massive increase in online sales.

Technologies Used

Python. An interpreted, high-level, and general-purpose programming language that enables programmers to write clear and logical code
Hive. A data warehouse software project built on top of Apache Hadoop to provide data query and analysis
BigQuery. A fully managed, serverless data warehouse that enables scalable, cost-effective, and fast analysis over petabytes of data
Apache Pig. A high-level platform for creating programs that run on Apache Hadoop
Spark Streaming. An extension of core Spark API (Application Programming Interface) that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
Apache Kafka. A unified, high-throughput, low-latency platform for handling real-time data feeds

Related Capabilities

Utilize Actionable Insights from Multiple Data Hubs to Gain More Customers and Boost Sales

Unlock the power of the data insights buried deep within your diverse systems across the organization. We empower businesses to effectively collect, beautifully visualize, critically analyze, and intelligently interpret data to support organizational goals. Our team ensures good returns on the big data technology investments with the effective use of the latest data and analytics tools.

Do you have a similar project in mind?

Enter your email address to start the conversation