- Sean Schallberger
- Bryan Carney
- Yadav Dhruvi
- Start Date: January 23, 2025
- Due Date: January 30, 2025 (11:59 PM)
This project focuses on building an ETL (Extract, Transform, Load) pipeline using Python, Pandas, and PostgreSQL. By extracting and transforming data, we generated CSV files, created an Entity-Relationship Diagram (ERD), defined a table schema, and uploaded data to a PostgreSQL database. This project demonstrates end-to-end ETL pipeline development and database integration.
- Extract data from Excel files.
- Transform the data into a format suitable for database insertion.
- Generate CSV files for different data entities.
- Create a PostgreSQL database and corresponding tables.
- Load data into the PostgreSQL database.
- Verify the database contents.
- Python
- SQL
- Pandas
- NumPy
- Regular Expressions
- PostgreSQL
- pgAdmin
- Jupyter Notebook
- Data extraction, transformation, and loading
- Entity-Relationship Diagram (ERD)
- Hands-on experience in designing and implementing an ETL pipeline.
- Proficiency in cleaning, transforming, and structuring raw data for database integration.
- Practical application of PostgreSQL for relational database design and querying.
- Columns:
category_id: Sequential entries (e.g.,cat1,cat2, ...).category: Unique category titles.
- Exported as
category.csv.
- Columns:
subcategory_id: Sequential entries (e.g.,subcat1,subcat2, ...).subcategory: Unique subcategory titles.
- Exported as
subcategory.csv.
- Columns:
cf_idcontact_idcompany_namedescription(formerlyblurb)goal(float)pledged(float)outcomebackers_countcountrycurrencylaunch_date(datetime)end_date(datetime)category_id(matchescategory.csv)subcategory_id(matchessubcategory.csv)
- Exported as
campaign.csv
- Columns:
contact_id(integer)first_namelast_nameemail
- Exported as
contacts.csv.
- Database Name:
crowdfunding_db - Database Creation Script:
crowdfunding_db_schema.sql- This file includes the schema and table definitions for the following tables:
categorysubcategorycampaigncontacts
- Relationships and constraints are defined with primary keys and foreign keys.
- This file includes the schema and table definitions for the following tables:
- Steps:
- Imported CSV data into corresponding tables.
- Verified data insertion with
SELECT *statements.
The following diagram illustrates the relationships between the tables in the database:
Crowdfunding_ETL/
│
├── ETL_Mini_Project_SSchallberger_BCarney_YDhruvi.ipynb # Main project notebook
├── crowdfunding_ERD.png # Database ERD diagram
├── crowdfunding_db_schema.sql # SQL schema file
├── Resources/ # Input files and raw data
│ ├── crowdfunding.xlsx # Crowdfunding data source
│ ├── contacts.xlsx # Contacts data source
│ ├── category.csv # Transformed category data
│ ├── subcategory.csv # Transformed subcategory data
│ ├── campaign.csv # Transformed campaign data
│ ├── contacts.csv # Transformed contacts data
├── Individual_Work/ # Individual contributions
│ ├── Sean Schallberger/ # Contains Sean's individual work files
│ ├── Bryan Carney/ # Contains Bryan's individual work files
│ ├── Dhruvi Yadav/ # Contains Dhruvi's individual work files
- Clone the repository:
git clone https://github.com/yourusername/Crowdfunding_ETL.git
- Navigate to the project directory:
cd Crowdfunding_ETL
- Open and run
ETL_Mini_Project_SSchallberger_BCarney_YDhruvi.ipynbin Jupyter Notebook.
-
The provided
crowdfunding_db_schema.sqlfile has the database creation commands commented out at the top:-- DROP DATABASE IF EXISTS crowdfunding_db; -- CREATE DATABASE crowdfunding_db; -- \c crowdfunding_db;
This means the database must either be created manually through pgAdmin or via the terminal before running the script. Follow these steps:
Option A: Create Database Manually in pgAdmin
- Open pgAdmin and create a new database named
crowdfunding_db. - Right-click on the database and open the Query Tool.
- Run the schema file (
crowdfunding_db_schema.sql) to create the necessary tables.
Option B: Use Terminal Commands
- Open your terminal and log into PostgreSQL:
psql -U username
- Create and connect to the database by running:
CREATE DATABASE crowdfunding_db; \c crowdfunding_db;
- Open pgAdmin and create a new database named
-
Once the database is created, execute the schema script:
psql -U username -d crowdfunding_db -f crowdfunding_db_schema.sql
-
Import CSV data into the database in the following order with the specified settings:
- Order:
category.csv,subcategory.csv,contacts.csv,campaign.csv - Settings: Delimiter =
,, Header =True
- Order:
-
Verify table contents with SQL queries. You can use one or more of the following approaches:
- Option 1: Run
SELECT * FROM table_name;for each table to view all records. - Option 2: Use
SELECT COUNT(*) FROM table_name;to confirm the number of rows imported matches expectations. - Option 3: -- Option 3: Validate data integrity by checking relationships between tables
SELECT c.category, sc.subcategory FROM category c JOIN campaign ca ON c.category_id = ca.category_id JOIN subcategory sc ON ca.subcategory_id = sc.subcategory_id;
- Option 1: Run
| Date | Milestone |
|---|---|
| Jan 23 | Project Kickoff |
| Jan 24 | Category and Subcategory DataFrames |
| Jan 25 | Campaign DataFrame |
| Jan 26 | Contacts DataFrame |
| Jan 27 | Database Schema and Integration |
| Jan 28 | Testing and Debugging |
| Jan 29 | Final Documentation and Submission |
