Skip to content

This project automates exploratory data analysis (EDA) with DataPulse, enabling users to upload, clean, and visualize datasets effortlessly. It integrates machine learning models like Logistic Regression and XGBoost for insightful analysis via an intuitive Streamlit interface.

License

Notifications You must be signed in to change notification settings

rakeshkapilavayi/DataPulse-Automated-EDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataPulse: Your Easy Data Analysis Tool

Hello! I'm Rakesh Kapilavayi, and I created DataPulse to help you explore and understand your data in a simple way. This tool is a web app that lets you:

  • Upload a dataset (CSV or Excel)
  • Clean it manually or automatically
  • Create visualizations
  • Run machine learning models
  • Generate professional insights and recommendations

Whether you're new to data or an expert, this app makes data analysis fun and easy!

🌐 Live Demo

Try it out here: Click Here


πŸš€ What Can DataPulse Do?

πŸ“‚ Upload Your Data

  • Add a CSV or Excel file to start.

πŸ“Š See a Summary

  • Number of rows and columns
  • Missing values
  • Duplicate rows
  • Data types and unique values

🧹 Clean Your Data

Manual Cleaning

  • Choose how to handle missing values (mean, median, mode, drop rows, etc.)
  • Delete duplicates
  • Checkbox turns light blue when selected

Auto Cleaning

  • Automatically:
    • Handle missing values
    • Remove duplicates
    • Cap outliers using IQR method

πŸ“ˆ Explore Your Data (EDA)

  • Histograms for numerical columns
  • Scatter plot for the two most correlated numeric columns
  • Heatmap showing correlation matrix
  • Bar charts for categorical columns
  • Box plots to detect outliers

πŸ€– Run Machine Learning

Choose Type

  • Classification (predict categories)
  • Regression (predict numbers)

Select a Model

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • XGBoost
  • SVM (Support Vector Machine)

Features:

  • Auto-handles categorical variables
  • View accuracy metrics and visual reports
  • Confusion matrix for classification
  • Feature importance rankings
  • Cross-validation scores
  • Option to tune model with hyperparameter optimization (for better performance; slower)
    • Checkbox turns light blue when selected

πŸ’‘ Get Professional Insights

  • Enhanced Insights: Get comprehensive, professionally formatted analysis reports including:
    • Dataset overview and characteristics
    • Key statistical findings and patterns
    • Correlation analysis and relationships
    • Data quality assessment
    • Strategic recommendations for next steps
    • Business impact interpretation
  • Quick Summary: Get instant statistical observations
  • Raw Statistical Data: Access detailed technical metrics

The insights combine statistical analysis with professional formatting to help you:

  • Understand missing value patterns
  • Identify important correlations
  • Assess data quality
  • Get actionable recommendations for further analysis

πŸ’Ύ Save Your Data

  • Download the cleaned dataset as a CSV file
  • Export insights reports as text files

πŸ›  What You Need

  • A computer with Python 3.8 or higher
  • A web browser (Chrome, Firefox, Safari, Edge, etc.)
  • Internet connection (for generating enhanced insights)

βš™οΈ How to Set It Up

1. Get the Files

Download or clone the project:

git clone https://github.com/rakeshkapilavayi/DataPulse-Automated-EDA.git
cd DataPulse-Automated-EDA

2. Set Up a Virtual Environment (optional but recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Requirements

pip install -r requirements.txt

This will install:

  • Streamlit (web framework)
  • Pandas (data manipulation)
  • NumPy (numerical operations)
  • Plotly (interactive visualizations)
  • Scikit-learn (machine learning)
  • XGBoost (advanced ML models)
  • Scipy (statistical functions)
  • Openpyxl (Excel support)
  • Google Generative AI (for enhanced insights)

4. Add a Logo (optional)

Place a logo.png file inside the project folder.

Don't have a logo? Remove or comment out the st.sidebar.image() line in interface.py.


▢️ How to Use DataPulse

1. Start the App

streamlit run interface.py

It will usually open at: http://localhost:8501

2. Upload a Dataset

  • Click "Choose a CSV or Excel file".
  • You'll see a preview of the dataset.
  • Uploading a new file resets the app.

3. Use the Tabs

πŸ” Summary

  • Basic stats on rows, columns, missing data, duplicates
  • Column information with data types
  • Unique value counts

πŸ›  Manual Cleaning

  • Fix missing data by choosing strategies per column
  • Remove duplicates manually
  • Full control over cleaning decisions

βš™οΈ Auto Cleaning

  • Clean your dataset automatically with one click
  • See detailed cleaning report:
    • Which columns were handled
    • How many duplicates removed
    • Which outliers were capped

πŸ“Š EDA (Exploratory Data Analysis)

  • Visualize your data with interactive charts:
    • Distribution histograms
    • Correlation scatter plots (optimized for large datasets)
    • Correlation heatmaps
    • Categorical distribution bar charts

🚨 Outliers

  • Detect outliers using box plots for all numerical columns
  • Visual identification of extreme values

πŸ€– Machine Learning

  • Choose task: Classification or Regression
  • Select target column from appropriate data types
  • Pick a model from the dropdown
  • Optional: Enable hyperparameter tuning for better performance
  • Click Train Model to view:
    • Evaluation metrics
    • Confusion matrix (classification)
    • Feature importance rankings
    • Cross-validation scores

πŸ“Š Insights

  • Generate Enhanced Insights: Get a comprehensive, professionally formatted analysis report covering:
    • Dataset characteristics and overview
    • Key statistical findings and their implications
    • Important correlations and relationships
    • Data quality assessment
    • Actionable recommendations
    • Business impact interpretation
  • Generate Quick Summary: Get instant bullet-point insights
  • View Raw Statistical Data: Access detailed technical metrics in expandable section

πŸ’Ύ Export

  • Download the cleaned dataset as a CSV file
  • Download insights reports as text files

πŸ“ Project Files

Here's what's inside the project folder:

DataPulse-Automated-EDA/
β”‚
β”œβ”€β”€ interface.py           # Main Streamlit application
β”œβ”€β”€ functions.py           # Data cleaning, EDA, and insights logic
β”œβ”€β”€ machinelearning.py     # ML models and training pipelines
β”œβ”€β”€ llm_insights.py        # Enhanced insights generation module
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ eda_app.log           # Application log file (auto-generated)
β”œβ”€β”€ logo.png              # Optional sidebar logo
└── README.md             # Project documentation

🎯 Key Features

  • βœ… Easy to Use: Clean, intuitive interface for all skill levels
  • βœ… Comprehensive EDA: Multiple visualization types with interactive charts
  • βœ… Smart Cleaning: Automated and manual data cleaning options
  • βœ… ML Ready: Train models with just a few clicks
  • βœ… Professional Insights: Get formatted, actionable analysis reports
  • βœ… Export Everything: Download cleaned data and insights
  • βœ… Performance Optimized: Handles large datasets efficiently
  • βœ… Session Management: Properly handles multiple dataset uploads

πŸ“§ Contact

Rakesh Kapilavayi


πŸ“ License

This project is open source and available for educational and personal use.


πŸ™ Acknowledgments

Built with:

  • Streamlit for the web framework
  • Plotly for interactive visualizations
  • Scikit-learn & XGBoost for machine learning
  • Pandas & NumPy for data processing

Made with ❀️ by Rakesh Kapilavayi

Happy Data Analyzing! πŸš€

About

This project automates exploratory data analysis (EDA) with DataPulse, enabling users to upload, clean, and visualize datasets effortlessly. It integrates machine learning models like Logistic Regression and XGBoost for insightful analysis via an intuitive Streamlit interface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages