AWS Data lake & Real time Data Processing for an Indian Retail Conglomerate

Overview

The AKIRA team was chartered to assess the data roadmap, infra setup, business needs for Data Management, Reporting and BI platform for an Indian Retail Conglomerate. The current structure of the data team is more of tactical than strategic. Current architecture supports data from multiple sources systems. Reporting services modules built on Superset and Excel.

Overall Systems Architecture

  • Client supports the external and internal reporting needs using Superset, extracts and excel.
  • There current BI architecture is built using a mix of technologies. These include Debezium, Knifi, Sqoop, Hive, Kafka, Presto.
  • This strategy over time has host of challenges in terms of scalability, usability, and quality standards due to many constraints of the older design being carried forward as well as gaps in functionality.
  • The diagrams below depict the current data platform architecture.

Flat files from the various source systems – CRT, GP and other utility applications will be uploaded to AWS S3 based on the respective data cadence

Views or tables to be utilized by BI reports, ad hoc queries by the users and outbound extracts for vendors.

  • Data from Batch and NRT sources are pushed directly to Manthan via Informatica ETL/Ingestion process

  • Solution architecture involves establishing a vision for data as an enterprise asset and guiding principle.

  • Datasets refresh and latency is not well managed in terms pipelines and jobs

  • Architect for long term, focus not on meeting 1 or 2 use cases, but on enabling use cases through a platform that is Agile, Scalable, reliable and can flex with business needs.

  • Daily processing is taking more than 6+ hours for the data to get available for reporting.

  • Effective operational a governance model.

  • NRT data sync is breaking most frequently, in-turn affects operational reporting and actions

  • Agile data model and architecture that will enable necessary data capabilities.

  • Data redundancy and multiple data touch points incurs lot of infra cost, both in storage and processing

  • An effective data quality, master data management, lineage of data and metadata management for a deeper understanding of data.

  • Analytical data insights that enable new business opportunities and processes

Proposed Architecture in AWS

Their POS data is fed into the system with the help of Kafka. Files are copied into FTP site.

1. Data Source – data from multiple MySQL DB and FTP (XML format).

2. Time based rules to trigger the ingestion process.

3. Lambda with Python to fetch the incremental data.

4. Landing zone S3, all the data is pushed in its original format.

5. Transformation & Validation using Lambda and Python.

6. Notification – SNS, as soon as new file from FTP arrives in S3 it will trigger the further processing.

7. Read, transform & copy data to Manthan.

8. Central DW – Redshift, Clean, ready to use data stored as a normalized dataset.

9.Consumption layer – data could be used for various purposes.

10.Logging, historical logs are saved in S3 and different matric are used to either trigger an event or alarm

11.Management layer, for security measures passwords and access keys are stored in Secrets Manager. User access and different AWS service access is maintained via IAM & for logging Cloud Trail and Cloud Watch is configured