Data Engineering Project-AWS Stream and Batch Processing Pipelines for Credit Card Transactions
Some weeks ago I started to build a data engineering project in AWS Cloud. As a Datawarehouse/ETL professional with more than 6+ years of experience, now I wanted to get into building data warehouses in the cloud. With that in mind and to understand the different cloud services, I built a data processing pipeline in AWS. In this hobby project, I took a credit card transaction dataset from open-source platform Kaggle and tried to built an end to end stream/batch pipelines (data collection, ETL, Reporting, etc.) in AWS using different AWS services. Below are the goals/use cases that I defined when I started with this project.
Objectives of this project:
- Build and understand a data processing framework in AWS used for stream and batch data loading by companies
- Setup and understand cloud components involved in data streaming and batch processing (API gateway, kinesis, lambda functions, S3, Redshift, RDS, QuickSight, Cloud9 etc.)
- Understand how to identify failures in an data processing pipelines and how to build systems that can handle the failures and errors better
- Understand how to approach or build an data processing pipeline from the ground up in AWS
Main Use Case (Transactional Database)
Work with Financial Transactions:
- Storage of transactions (Credit card transaction details to merchants) and user access to the individuals transactions
- Alerting fraudulent transactions in the dashboard in real time
- Monitor real time transactions in dashboard (city wise, state wise, timewise etc., both authentic and fraudulent transactions).
- Tune transactional database tables for better read/write performance (Highly Normalized)
Analytical Use Case (Data Warehouse)
- Cube models (just drag and drop dimensions across fact tables in a reporting tool to view aggregated analytics) example: Fraud transactions per city for a specific time frame, overall transaction on a specific day for a specific location etc
- Fact and Dimensional tables (Star Schema) in a data warehouse Fact: Transaction , Dimensions: Customer, Address, Merchant, Time
Below are my medium blog posts where I documented each and every part of this project in detail.
- OLTP, OLAP Database Modelling on a Kaggle Dataset
- Building a Stream Processing Pipeline in AWS
- Bulk Import from AWS S3 Bucket into RDS Aurora Serverless using AWS Cloud9
- Building a Grafana Dashboard in AWS on Serverless Aurora RDS
- Amazon QuickSight Dashboard for S3 CSV Data Using Amazon Athena / Glue Crawler
- Amazon QuickSight Dashboard for S3 CSV Data using Redshift Spectrum / Glue Crawler
- Populating Amazon Redshift DWH from S3 & QuickSight Reporting