Project for my master thesis.
Automates insurance claim modeling in Kedro: data processing, sampling, feature engineering, feature selection, GLM/LightGBM tuning, recalibration, and validation. Configurable via YAML. Supports generating, scheduling, and tracking hundreds of experiments. Outputs: metrics and charts. Uses Kedro, MLflow, Python.
The dataset used in this project can be found at: Dataset Link
This script creates a Python virtual environment and installs the required dependencies.
Usage:
./create_venv.shThis script sets up a PostgreSQL database for a mlflow server. It creates a database named mlflow_db and a user named mlflow_user with the specified password. The script also grants all privileges on the database to the user.
Usage:
./setup_postgres.shThe script will prompt you for the PostgreSQL password for mlflow user.
This script starts the MLflow server for tracking experiments.
Usage:
./run_mlflow.shThis script starts a JupyterLab server for interactive data analysis and development.
Usage:
./run_jupyterlab.shThis script runs the Kedro pipeline defined in the project.
Usage:
./kedro_run.sh [OPTIONS]Options:
--pipeline,-p: Specify the pipeline to run (default:__default__).--mlflow-run-id: Continue the MLflow run with the given run ID.
-
In the
experimentsdirectory, copy theexperiments_dir_templatedirectory and rename it to your desired experiment name, e.g.,my_experiment_name. -
Edit the files in
experiments/my_experiment_name/templates/(parameters.ymlandmlflow.yml) to define the parameters for your experiment. -
In any notebook in the project (e.g.,
notebooks/my_experiment_analysis.ipynb), run the following method to create a new experiment run:create_experiment_run( experiment_name=experiment_name, run_name=run_name, template_parameters=template_parameters )
where:
experiment_nameis the name of your experiment directory,run_nameis the name of the new MLflow run,template_parametersis a dictionary of parameters whose values will replace the template tags in the files located inexperiments/<experiment_name>/templates/.
You can call this method multiple times with different pairs of
run_nameandtemplate_parametersto create multiple runs for the same experiment.Import the method in your notebook as follows:
from claim_modelling_kedro.experiments.experiment import ( create_experiment_run, default_run_name_from_run_no )
-
Run the experiment using the provided
run_experiment.shscript. See usage below. -
Restore the default configuration files using the
restore_default_config.shscript. See usage below.
Useful methods for viewing experiment results in your notebook:
-
From
claim_modelling_kedro.experiments.experiment:get_run_mlflow_id
-
From
claim_modelling_kedro.pipelines.utils.dataframes:load_metrics_table_from_mlflowload_predictions_and_target_from_mlflowload_metrics_cv_stats_from_mlflow
-
From
claim_modelling_kedro.pipelines.utils.datasets:get_partitionget_mlflow_run_id_for_partition
This script runs an experiment for different pipelines.
It requires the experiment name and the name of the first pipeline to run as positional arguments.
Optionally, you can specify:
- another pipeline to run for all subsequent runs,
- a specific run name or multiple run names, and
- a run name from which run_experiment.sh should continue.
The script copies the rendered templates from experiments/<experiment_name>/templates/ to the Kedro configuration directory.
Usage:
./run_experiment.sh <experiment_name> <first_pipeline> [--other-pipeline OTHER_PIPELINE] [--run-name RUN_NAME [RUN_NAME ...]] [--from-run-name FROM_RUN_NAME]<experiment_name>: Name of the experiment directory (required)<first_pipeline>: Name of the first pipeline to run (required)--other-pipeline OTHER_PIPELINE: (optional) Name of the second pipeline to run after the first--run-name RUN_NAME [RUN_NAME ...]: (optional) One or more run names to use for the experiment(s)--from-run-name FROM_RUN_NAME: (optional) Use this run name as a template for the new run(s)
Example:
./run_experiment.sh sev_001_dummy_mean_regressor dsor with additional options:
./run_experiment.sh sev_001_dummy_mean_regressor all_to_test --other-pipeline smpl_to_test --run-name my_run_1 my_run_2This script restores the default configuration files for the project from claim_modelling_kedro/conf/default/.
Usage:
./restore_default_config.shThis script allows you to delete, restore, or permanently delete (purge) an MLflow experiment.
It reads the tracking URI from: claim_modelling_kedro/conf/local/mlflow.yml.
Supported actions:
- delete – soft-deletes the experiment (marks it as deleted)
- restore – restores a soft-deleted experiment
- purge – permanently deletes the experiment
⚠️ Requires MLflow ≥ 2.7 and SQL backend.
Usage:
./manage_mlflow_experiment.sh <delete|restore|purge> (--name <experiment_name> | --id <experiment_id>)Examples:
- Soft-delete by name
./manage_mlflow_experiment.sh delete --name sev_001_dummy_mean_regressor
- Restore by ID
./manage_mlflow_experiment.sh restore --id 12
- Permanently delete by name
./manage_mlflow_experiment.sh purge --name sev_001_dummy_mean_regressor
This script lists all MLflow experiments along with their name, ID, and lifecycle stage.
It uses the tracking URI configured in claim_modelling_kedro/conf/local/mlflow.yml.
Usage:
./list_mlflow_experiments.shOutput includes:
- experiment name
- experiment ID
- status (active or deleted)