Statistical analysis of feature relevance using the Likelihood Ratio Test applied to the Wisconsin Breast Cancer Dataset.
This repository presents a statistical study based on the Likelihood Ratio Test (LRT) to identify the most relevant features for distinguishing malignant from benign tumors.
The project combines theoretical foundations of hypothesis testing with a practical application to medical data, focusing on nested model comparison and feature significance.
The objectives of this work are to:
- Introduce the theoretical principles of the Likelihood Ratio Test
- Apply LRTs to a real-world medical dataset
- Identify tumor characteristics that significantly improve malignancy inference
- Illustrate the connection between statistical theory and applied data analysis
The analysis follows the classical Likelihood Ratio Testing framework:
- Definition of null and alternative hypotheses
- Construction of reduced and full logistic regression models
- Computation of maximized log-likelihoods
- Derivation of the LRT statistic
- Statistical decision based on asymptotic theory (Wilks’ theorem)
Each feature is tested by comparing a model excluding the feature (null model) to a model including it (full model).
- Dataset: Wisconsin Breast Cancer (Diagnostic) Dataset
- Observations: 569 tumor samples
- Labels: Malignant / Benign
- Features: 30 continuous variables describing tumor morphology and texture
The figure below illustrates the distribution of the texture (mean) feature for malignant and benign tumors, highlighting a clear separation between the two classes.
This visual evidence motivates the statistical testing of texture-related features using the Likelihood Ratio Test.
The table below summarizes the Likelihood Ratio Test statistics and corresponding p-values for selected tumor features.
Features with low p-values provide strong evidence against the null hypothesis and are considered statistically significant contributors to malignancy prediction.