This project analyzes multi-subject wearable sensor data from mHealth logs to explore how physical activity levels affect acceleration and heart rate.
It includes data cleaning, feature engineering, statistical testing, and visual analytics.
Data-Analytics-Assignment-4/
β
βββ CONTRIBUTORS.md
βββ LICENSE
βββ README.md
βββ RULES.md
βββ DA IA/
βββ DA_IA_4.ipynb <-- Main Project
βββ activity_summary_statistics.csv
βββ mHealth_subject1.log
βββ mHealth_subject10.log
βββ mHealth_subject2.log
βββ mHealth_subject3.log
βββ mHealth_subject4.log
βββ mHealth_subject5.log
βββ mHealth_subject6.log
βββ mHealth_subject7.log
βββ mHealth_subject8.log
βββ mHealth_subject9.log
βββ pairwise_comparisons.csv
ββββ .ipynb_checkpoints
βββ Untitled0-checkpoint.ipynb
- Run Block 1 to import libraries
- Run Block 2 to load all subject data files
- Expected Output: Confirmation of loaded subjects and total samples
- Displays basic dataset information
- Shows first few rows and data types
- Expected Output: DataFrame preview and statistics
- Block 4: Check and impute missing values
- Block 5: Remove null class (label 0)
- Block 6: Detect and remove outliers using IQR method
- Expected Output: Cleaning summary statistics
- Creates acceleration magnitude features
- Estimates heart rate from ECG
- Adds activity name labels
- Expected Output: New feature confirmation
- Block 8: Activity distribution visualization
- Block 9: Descriptive statistics by activity
- Block 10: Acceleration comparison plots
- Block 11: Heart rate visualization
- Block 12: Correlation analysis
- Expected Output: Multiple visualizations and statistics tables
- Block 13: Setup test data
- Blocks 14-16: T-tests (High vs Low intensity)
- Blocks 17-18: ANOVA (Multiple activities)
- Block 19: Post-hoc pairwise comparisons
- Block 20: ANOVA visualizations
- Expected Output: Statistical test results with p-values
- Inter-subject variation analysis
- Subject-activity heatmaps
- Expected Output: Variation metrics and heatmaps
- Comprehensive results summary
- Data export options
- Expected Output: Summary statistics and saved files
Load 10 subject files β Combine into single DataFrame β Validate data integrity
Check missing values β Imputation (forward/backward fill) β Remove null class β Outlier detection (IQR method) β Remove outliers β Validate cleaned data
Outlier Detection Method:
- Uses Interquartile Range (IQR)
- Threshold: 3 Γ IQR
- Formula: [Q1 - 3ΓIQR, Q3 + 3ΓIQR]
Raw sensor data β Calculate magnitude:
- Chest acceleration magnitude = β(xΒ² + yΒ² + zΒ²)
- Ankle acceleration magnitude = β(xΒ² + yΒ² + zΒ²)
- Arm acceleration magnitude = β(xΒ² + yΒ² + zΒ²)
ECG signals β Heart rate estimation:
- Group by subject and activity
- Calculate standard deviation
- Apply scaling factor: HR = 70 + (ECG_std Γ 30)
A. Descriptive Statistics
- Mean, standard deviation, min, max for each activity
- Sample sizes and distributions
- Coefficient of variation across subjects
B. Inferential Statistics
- Independent T-tests for binary comparisons
- One-way ANOVA for multiple group comparisons
- Post-hoc pairwise tests with Bonferroni correction
Purpose: Compare means between two independent groups
Hypotheses:
- Hβ (Null): ΞΌβ = ΞΌβ (no difference in means)
- Hβ (Alternative): ΞΌβ β ΞΌβ (means are different)
Test Statistic:
t = (xΜβ - xΜβ) / β(sβΒ²/nβ + sβΒ²/nβ)
Interpretation:
- p < 0.05: Reject null hypothesis (significant difference)
- p β₯ 0.05: Fail to reject null hypothesis (no significant difference)
Effect Size (Cohen's d):
d = (ΞΌβ - ΞΌβ) / Ο_pooled
- |d| < 0.2: Small effect
- 0.2 β€ |d| < 0.5: Medium effect
- |d| β₯ 0.5: Large effect
Our Application:
- Compare high-intensity (Jogging, Running) vs low-intensity (Sitting, Standing)
- Variables: chest acceleration magnitude, heart rate estimate
Purpose: Compare means across three or more groups
Hypotheses:
- Hβ: ΞΌβ = ΞΌβ = ΞΌβ = ... = ΞΌβ (all group means equal)
- Hβ: At least one group mean differs
Test Statistic:
F = MS_between / MS_within
Interpretation:
- p < 0.05: At least one group differs significantly
- p β₯ 0.05: No significant differences between groups
Post-Hoc Tests:
- When ANOVA is significant, conduct pairwise comparisons
- Bonferroni Correction: Ξ±_adjusted = Ξ± / number_of_comparisons
- Controls for Type I error inflation in multiple comparisons
Our Application:
- Compare 5 activities: Walking, Jogging, Running, Sitting, Cycling
- Variables: chest acceleration magnitude, heart rate estimate
High-Intensity Activities (Jogging, Running):
- Expected Mean: 12-20 m/sΒ²
- Higher variability due to dynamic movements
- Significant difference from low-intensity activities
Low-Intensity Activities (Sitting, Standing):
- Expected Mean: 9-11 m/sΒ² (close to gravity)
- Lower variability due to minimal movement
- Primarily captures gravitational acceleration
Statistical Significance:
- T-test p-value: Expected < 0.001 (highly significant)
- Cohen's d: Expected > 0.8 (large effect size)
High-Intensity Activities:
- Expected Range: 100-140 bpm
- Reflects increased cardiovascular demand
Low-Intensity Activities:
- Expected Range: 65-80 bpm
- Near resting heart rate
Statistical Significance:
- T-test p-value: Expected < 0.01 (significant)
- Moderate to large effect size
Expected Pattern (Acceleration Magnitude):
Running > Jogging > Walking > Cycling > Sitting
F-statistic: Expected > 100 (very large) P-value: Expected < 0.0001 (extremely significant)
Post-Hoc Comparisons:
- Running vs Sitting: Highly significant (p < 0.001)
- Jogging vs Walking: Moderately significant (p < 0.05)
- Walking vs Sitting: Significant (p < 0.01)
- *p < 0.001: Very strong evidence against Hβ ()
- p < 0.01: Strong evidence against Hβ ()
- *p < 0.05: Moderate evidence against Hβ ()
- p β₯ 0.05: Insufficient evidence to reject Hβ (ns)
- Consider both statistical and practical significance
- Large sample sizes can yield significant p-values for trivial differences
- Always examine effect sizes and mean differences
- Box plots: Show distribution, median, quartiles, outliers
- Violin plots: Show full distribution shape
- Heatmaps: Reveal patterns across subjects and activities