Title: Financial and Fraud Analysis Pipeline: ETL, Pyspark and Analytics
Objective: Build an end-to-end data pipeline for finance data. (credit_card_transactions.csv)
Skills: ETL, SQL schema design, Data Cleaning, Analytics, Visualization.
Tools: Python (Pandas), Pyspark, Matplotlib, Seaborn.
Topics covered:
- Data extraction
- Tranformation
- Loading data
- Exploratory analytics
Challenges: Data loading after processing as per schema definition into separate tables. Defining surrogate keys and using joins to establish current relation ships between records.
Conclusion:
- Most common categories sector of fraudulent activities identified.
- Late night to early morning hours needs more monitoring efforts to be put in.
- Unauthorized transactions does not always show for higher prices, rather it is more relevant for transactions of lower amount (<$500).
- Top states with the highest activities identified.