App Rating Prediction using Linear Regression
Course-end Project 1
The Google Play Store team is about to launch a new feature wherein certain promising apps will be boosted in visibility. This boost will manifest in multiple ways, including higher priority in recommendations sections (“Similar apps”, “You might also like”, “New and updated games”), and higher ranking in search results.
This feature will help bring more attention to newer apps that have the potential to succeed.
Objective:
The task is to predict app ratings based on available features so that Google can identify which apps are good candidates for promotion.
File Used: googleplaystore.csv
Fields in the data:
- App: Application name
- Category: Category to which the app belongs
- Rating: Overall user rating of the app
- Reviews: Number of user reviews for the app
- Size: Size of the app
- Installs: Number of user downloads/installs for the app
- Type: Paid or Free
- Price: Price of the app
- Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult
- Genres: An app can belong to multiple genres (apart from its main category).
- Last Updated: Date when the app was last updated on Play Store
- Current Ver: Current version of the app available on Play Store
- Android Ver: Minimum required Android version
- Load the dataset
googleplaystore.csvusing pandas.
- Identify missing values.
- Count null values per column.
- Drop rows with missing data.
- Convert
Sizeinto numeric (Kb → Mb conversion). - Convert
Reviewsto numeric. - Clean
Installs(remove+and,) and convert to integer. - Clean
Price(remove$) and convert to numeric.
- Keep ratings only between 1 and 5.
- Ensure reviews ≤ installs.
- For free apps (
Type = Free), price must be0.
- Boxplot for Price → Detect high outliers.
- Boxplot for Reviews → Check extremely high counts.
- Histogram for Rating → See rating distribution.
- Histogram for Size → Distribution of app sizes.
- Remove apps with suspiciously high prices.
- Drop apps with more than 2M reviews.
- Handle outliers in Installs using percentile thresholds.
- Scatter plots: Rating vs Price, Size, Reviews.
- Boxplots: Rating vs Content Rating, Rating vs Category.
- Interpret relationships and patterns.
- Apply log transformation (
np.log1p) toReviews&Installs. - Drop unused columns:
App,Last Updated,Current Ver,Android Ver. - Convert categorical variables (
Category,Genres,Content Rating,Type) into dummy variables.
- Perform 70-30 split into
df_trainanddf_test.
- Create
X_train,y_train,X_test, andy_test.
- Train Linear Regression model.
- Report R² score on the training set.
- Make predictions on the test set.
- Report R² score on test data.
- Interpret results.
- R² on Training Set: ~0.1662
- R² on Test Set: ~0.1295
The model explains only a small portion of the variance in ratings, showing that additional features or advanced models may be required for better prediction.
- Data cleaning and preprocessing are essential before modeling.
- Linear regression provides baseline performance but may not be sufficient for complex patterns.
- This project demonstrates end-to-end data preprocessing, visualization, and regression modeling on real-world app store data.
The repository contains the following files and folders:
notebooks/App_Rating.ipynb— Complete step-by-step notebook: data cleaning, EDA, outlier treatment, preprocessing, model building and evaluation.googleplaystore.csv— Original dataset used for the project. (If the file is large or private, you may include a smaller sample here and add download instructions.)README.md— Project overview, problem statement, steps performed, results and instructions to run the project.requirements.txt— List of Python packages needed to run the notebook (usepip install -r requirements.txt)..gitignore— Patterns for files that should not be committed (virtual envs, dataset if you choose to keep it private, notebook checkpoints, etc.).