This project is inspired by end-to-end analytics case studies built by Ankit Bansal’s (Snowflake + AWS walkthrough) using the Yelp Open Dataset but reimagined entirely on Google Cloud Platform (GCP). The objective is to: Design and implement a simple yet extensible ELT pipeline that ingests raw review data into Cloud Storage and maps it into BigQuery. From there, I can do downstream tasks such as sentiment analysis, theme extraction (via Vertex AI), and dashboarding.
- Vast volumes of unstructured customer review text remain dormant, preventing timely business decision-making.
- Businesses struggle to move beyond simple ratings to identify root causes of dissatisfaction (e.g., product vs. service). They lack automated pipelines for proactive customer service and data-driven investment.
- This project demonstrates how raw text reviews can be transformed into business-ready insights through automation, analytics, and visualization.
- Analysis was limited to a 600k review subset (for ELT) and a 14k subset (for intensive downstream AI) due to cloud cost and resource limitations.
- ELT automation was focused solely on the Reviews data. Business is handled manually, and User data were excluded.
- Visualization was constrained to Power BI due to Looker Studio trial limitations. Automated data refresh for the Power BI has not yet been configured
- LLM throughput limits (despite parallelization) forced the use of traditional ML methods executed via a Vertex AI Workbench script, rather than the Batch Prediction service.
- Event-Driven ELT Pipeline: Automated, serverless pipeline on GCP that transforms raw JSON to a clean, partitioned BigQuery Analytics table.
- Hybrid Downstream Analytics: POC for data enrichment using both VADER sentiment (fast, lexicon-based) and TF-IDF + KMeans (unsupervised themes).
- BI Dashboard Reporting: Developed a Power BI Dashboard with a manual data flow from BigQuery, enabling immediate visualization and access to key customer insights.
- Cloud Functions
- BigQuery
- Cloud Storage
- Vertex AI Workbench
- Power BI
- Executed a manual ELT run using the Google Cloud CLI to validate core transformation logic.
- Mapped over 83k distinct business categories to 10 standard
general_categorylabels using fuzzy string matching (fuzzywuzzy), and reloaded the mapping into the staging environment.
- Developed and deployed Python Cloud Functions to create an event-driven ELT pipeline for the reviews data.
- Ran SQL queries to answer five basic questions, verifying the quality of the transformed data.
- Asked 5 questions to better understand the data in
data_joinedtable - Performed a detailed downstream task focusing on data from a single
general_category. - Defined key business questions and sketched the required Power BI dashboard layout.
- Developed the Power BI dashboard, extracted key findings, and formulated actionable recommendations.
├───cloud_function
│ cloud_function_1_UPD.py # Cloud Function 1: Handle the extract, load
│ cloud_function_2_UPD.py # Cloud Function 2: Handle the transformation
│ requirements.txt # Requirements for both Cloud Function
│
├───downstream_task
│ downstream-task-t3.ipynb # Notebook for downstream task
│
├───prepare_and_process
│ get_random_review.py # Script to get random review from splitted JSON reviews
│ split_large_json_review.py # Script to split large JSON reviews
| general_business_category_mapping.ipynb # Notebook for mapping general business category
Pipeline Components Overview
The business data required a one-time manual ELT pipeline due to its unique complexity. The original Yelp business categories, stored in the categories field, containing multiple specific entries per business (over 83,000 distinct combinations observed). This granularity made it difficult to define a high-level category necessary for downstream analysis.
To address this, I implemented a specialized data treatment:
- All distinct categories were extracted and mapped to a set of 10 standardized general category labels using the
fuzzywuzzyalgorithm for flexible string matching. - The resulting mapping table was loaded back into the
yelp_landingdataset as thebusiness_category_mapping_fwtable. - This mapping table was then transformed and loaded into the
yelp_analyticsdataset by performing aLEFT JOINwith thebusinesses_rawtable. This standardized business data is now ready to be joined with the reviews data in the final gold layer for analysis.
The reviews data is handled by a fully automated, event-driven ELT pipeline using Cloud Functions. The pipeline is triggered every time a new JSON file is uploaded to the designated input Cloud Storage bucket. It is designed to handle out-of-order data arrival and prevent duplicate records in the final analytics table. The detail is as follow:
| Component | Detail | Purpose |
|---|---|---|
| Input Bucket | yelp-elt-inbound-data |
Trigger: New JSON files are uploaded here, triggering the ELT process. |
| File Format | Newline-Delimited JSON (NDJSON) | Original Dataset Format. |
This function handles the Extract and Load phases.
| Feature | Details |
|---|---|
| Trigger | GCS Event (google.cloud.storage.object.v1.finalized) on the input bucket. |
| Action | Loads the newly uploaded JSON file directly into the staging BigQuery table. |
| Behavior | Uses WRITE_APPEND to load all data into the staging table. |
| Next Step | Securely triggers Cloud Function 2 (transform_data) via an authenticated HTTP request upon successful load. |
| Feature | Detail |
|---|---|
| Purpose | Temporary storage for raw, unsanitized data. Acts as an audit log. |
| Data Integrity | May contain duplicate review_id records due to file re-uploads, but this is handled by the next stage. |
This function handles the core business logic, transformation, and incremental loading.
| Feature | Details |
|---|---|
| Trigger | HTTP request from Cloud Function 1. |
| Action | Executes a BigQuery MERGE statement for atomic, race-condition-safe operations. |
| Deduplication Logic | Uses the MERGE ... USING ... ON target.review_id = source.review_id WHEN NOT MATCHED THEN to ensure records are only inserted once, even with concurrent executions |
| Feature | Detail |
|---|---|
| Data Integrity | Contains only unique review_id records. |
| Structure | Partitioned by date (truncated to YEAR) and Clustered by business_id for query optimization. |
| Readiness | Ready for consumption by downstream tasks. |
Before commencing downstream tasks, a preliminary data exploration was conducted using SQL queries to ensure data quality and establish a foundational understanding of the transformed dataset. The five key questions defined below provided essential context for subsequent analysis.
| Question | Business Impact |
|---|---|
| Business Count per Category | Market sizing and coverage by category |
| Popular Categories (by Reviews) | Customer engagement and demand proxy per category |
| % of 5-Star Reviews per Business | Quality signal and competitive benchmarking |
| Top 3 Most-Reviewed Businesses by State | Identifying local popularity and state-level leaders |
| Avg. Rating for High-Volume Businesses | Stable performance view, filtering out noise from low review counts |
After completing the ELT pipeline with Cloud Function and completing Data Exploration with SQL, I proceeded with downstream analytics to enrich the Yelp reviews data. The processed dataset is stored in a new BigQuery dataset: yelp_golden.
-
Task 1 & 2: Sentiment Analysis
- Applied VADER for fast sentiment scoring (positive, neutral, negative).
- Added both sentiment label and sentiment score to each review.
-
Task 3: Review Theme Extraction
-
Used TF-IDF + KMeans clustering to group reviews into themes.
-
Interpreted clusters into 6 themes relevant to Shopping & Retail:
- Product Variety & Selection
- Clothing & Apparel Experience
- Hair & Beauty Services
- Customer Service Quality
- Staff Friendliness & Atmosphere
- Mall & Food Court Experience
-
Note: These tasks were executed on Vertex AI Workbench Notebook for flexibility. BigQuery ML was considered, but initial experiments with thousands of reviews showed longer runtimes.
️ Note: This analysis is based on a sample of ~14,700 reviews from a single business category (Shopping & Retail). The results do not represent the entire Yelp dataset or real-world conditions. The purpose is to demonstrate insights that can be derived after the full pipeline (ELT + downstream processing).
-
Built a Power BI dashboard
-
Dashboard design started with business questions and sketching.
-
Final dashboard includes:
- KPIs: Count of reviews, Avg. Star Review, Avg. Star Business
- Sentiment distribution by theme
- Star rating distribution by theme (matrix heatmap)
- Filters for Sentiment
Key Findings
- Product Variety & Selection dominates in review volume and is largely positive.
- Clothing & Apparel Experience shows mixed ratings, with a notable skew toward 1-star reviews.
- Hair & Beauty Services performs well overall, with a strong concentration in 5-star reviews and minimal negative feedback. Suggesting this segment delivers reliably high customer experiences.
Recommendation
- Despite having the highest review volume, “Product Variety & Selection” also includes noticeable negative sentiment. Focus on stock availability, layout, and product curation.
- The mixed and lower review distribution here suggests potential improvement opportunities in food quality, cleanliness, or seating comfort to boost overall retail experience.
Enforce strict schema validation (e.g., data types, naming) in the Cloud Function.
- Add Urgency Tag and LLM-Generated Response for immediate customer service action.
- Deploy analysis to a scheduled Vertex AI Pipeline (Batch Prediction) to auto-process all new and existing data.
- Use embeddings model and clustering for more accurate semantic theme extraction.
- Use Power BI DirectQuery or scheduled refresh via Gateway for live data updates.
- Implement a scheduled Cloud Run service to generate and send narrative summary reports using Gen AI to the stakeholder.
- Extract Load Transform Concept
- Medallion Lakehouse Architecture
- ETL Cloud Function Tutorial
- Bigquery Batch Prediction Job Tutorial
- Vertex AI Custom Code Training Tutorial
Developed by Roissyah Fernanda. Some parts of the implementation and documentation were assisted by AI tools (e.g., for code refactoring, formatting, and technical writing support), while all project ideas, integration, and decision-making were independently developed by the author.
Your input is valued! If you spot a bug or want to suggest an improvement, please submit a pull request. Let's collaborate to make this end-to-end Data Analytics project even better!



