Hello! I'm Rakesh Kapilavayi, and I created DataPulse to help you explore and understand your data in a simple way. This tool is a web app that lets you:
- Upload a dataset (CSV or Excel)
- Clean it manually or automatically
- Create visualizations
- Run machine learning models
- Generate professional insights and recommendations
Whether you're new to data or an expert, this app makes data analysis fun and easy!
Try it out here: Click Here
- Add a CSV or Excel file to start.
- Number of rows and columns
- Missing values
- Duplicate rows
- Data types and unique values
- Choose how to handle missing values (mean, median, mode, drop rows, etc.)
- Delete duplicates
- Checkbox turns light blue when selected
- Automatically:
- Handle missing values
- Remove duplicates
- Cap outliers using IQR method
- Histograms for numerical columns
- Scatter plot for the two most correlated numeric columns
- Heatmap showing correlation matrix
- Bar charts for categorical columns
- Box plots to detect outliers
- Classification (predict categories)
- Regression (predict numbers)
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
- SVM (Support Vector Machine)
Features:
- Auto-handles categorical variables
- View accuracy metrics and visual reports
- Confusion matrix for classification
- Feature importance rankings
- Cross-validation scores
- Option to tune model with hyperparameter optimization (for better performance; slower)
- Checkbox turns light blue when selected
- Enhanced Insights: Get comprehensive, professionally formatted analysis reports including:
- Dataset overview and characteristics
- Key statistical findings and patterns
- Correlation analysis and relationships
- Data quality assessment
- Strategic recommendations for next steps
- Business impact interpretation
- Quick Summary: Get instant statistical observations
- Raw Statistical Data: Access detailed technical metrics
The insights combine statistical analysis with professional formatting to help you:
- Understand missing value patterns
- Identify important correlations
- Assess data quality
- Get actionable recommendations for further analysis
- Download the cleaned dataset as a CSV file
- Export insights reports as text files
- A computer with Python 3.8 or higher
- A web browser (Chrome, Firefox, Safari, Edge, etc.)
- Internet connection (for generating enhanced insights)
Download or clone the project:
git clone https://github.com/rakeshkapilavayi/DataPulse-Automated-EDA.git
cd DataPulse-Automated-EDApython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtThis will install:
- Streamlit (web framework)
- Pandas (data manipulation)
- NumPy (numerical operations)
- Plotly (interactive visualizations)
- Scikit-learn (machine learning)
- XGBoost (advanced ML models)
- Scipy (statistical functions)
- Openpyxl (Excel support)
- Google Generative AI (for enhanced insights)
Place a logo.png file inside the project folder.
Don't have a logo? Remove or comment out the st.sidebar.image() line in interface.py.
streamlit run interface.pyIt will usually open at: http://localhost:8501
- Click "Choose a CSV or Excel file".
- You'll see a preview of the dataset.
- Uploading a new file resets the app.
- Basic stats on rows, columns, missing data, duplicates
- Column information with data types
- Unique value counts
- Fix missing data by choosing strategies per column
- Remove duplicates manually
- Full control over cleaning decisions
- Clean your dataset automatically with one click
- See detailed cleaning report:
- Which columns were handled
- How many duplicates removed
- Which outliers were capped
- Visualize your data with interactive charts:
- Distribution histograms
- Correlation scatter plots (optimized for large datasets)
- Correlation heatmaps
- Categorical distribution bar charts
- Detect outliers using box plots for all numerical columns
- Visual identification of extreme values
- Choose task: Classification or Regression
- Select target column from appropriate data types
- Pick a model from the dropdown
- Optional: Enable hyperparameter tuning for better performance
- Click Train Model to view:
- Evaluation metrics
- Confusion matrix (classification)
- Feature importance rankings
- Cross-validation scores
- Generate Enhanced Insights: Get a comprehensive, professionally formatted analysis report covering:
- Dataset characteristics and overview
- Key statistical findings and their implications
- Important correlations and relationships
- Data quality assessment
- Actionable recommendations
- Business impact interpretation
- Generate Quick Summary: Get instant bullet-point insights
- View Raw Statistical Data: Access detailed technical metrics in expandable section
- Download the cleaned dataset as a CSV file
- Download insights reports as text files
Here's what's inside the project folder:
DataPulse-Automated-EDA/
β
βββ interface.py # Main Streamlit application
βββ functions.py # Data cleaning, EDA, and insights logic
βββ machinelearning.py # ML models and training pipelines
βββ llm_insights.py # Enhanced insights generation module
βββ requirements.txt # Python dependencies
βββ eda_app.log # Application log file (auto-generated)
βββ logo.png # Optional sidebar logo
βββ README.md # Project documentation
- β Easy to Use: Clean, intuitive interface for all skill levels
- β Comprehensive EDA: Multiple visualization types with interactive charts
- β Smart Cleaning: Automated and manual data cleaning options
- β ML Ready: Train models with just a few clicks
- β Professional Insights: Get formatted, actionable analysis reports
- β Export Everything: Download cleaned data and insights
- β Performance Optimized: Handles large datasets efficiently
- β Session Management: Properly handles multiple dataset uploads
Rakesh Kapilavayi
- Email: [email protected]
- LinkedIn: Rakesh Kapilavayi
- GitHub: rakeshkapilavayi
This project is open source and available for educational and personal use.
Built with:
- Streamlit for the web framework
- Plotly for interactive visualizations
- Scikit-learn & XGBoost for machine learning
- Pandas & NumPy for data processing
Made with β€οΈ by Rakesh Kapilavayi
Happy Data Analyzing! π