Add script to calculate dataset statistics and monthly label distribution for temporal segmentation by Copilot · Pull Request #2 · collab-uniba/Monitoring_Issue_Report_Classifiers

Copilot · 2025-12-15T14:20:15Z

Need to calculate training/test months and instances from configuration files to populate a LaTeX table for paper submission.

Implementation

calculate_dataset_stats.py: Script that reuses DataHandler, ConfigManager, and LabelMapper to calculate statistics using the same data loading logic as the classification pipeline
- Counts unique time window files (months) for train/test splits
- Counts total instances from loaded DataFrames
- Lists all training and test month files for debugging purposes (chronologically sorted)
- Calculates monthly label distribution with cumulative statistics
- Outputs LaTeX table and optional CSV format

Usage

# Process default configs (apache-m1, jira-m1, redhat-m1)
python calculate_dataset_stats.py

# Generate table for paper
python calculate_dataset_stats.py --output dataset_stats.tex

# Save statistics with month file details to CSV
python calculate_dataset_stats.py --output-csv dataset_stats.csv

# Calculate and display monthly label distribution
python calculate_dataset_stats.py --monthly-distribution

# Save monthly distribution to CSV
python calculate_dataset_stats.py --monthly-distribution --monthly-output-csv monthly_stats.csv

Output Example

Dataset Statistics Summary

\begin{table}[t]
\centering
\caption{Temporal segmentation summary for selected ecosystems.}
\label{tab:temporal-slices-summary}
\resizebox{\linewidth}{!}{%
\begin{tabular}{rrrr}
\toprule
\textbf{Ecosystem} & \textbf{Training Months} & \textbf{Test Months} & \textbf{Training Instances} \\
\midrule
Apache & \texttt{51} & \texttt{9} & \texttt{15980} \\
...

The script also outputs detailed month file listings for debugging:

Apache:
  Training Months: 51
  Test Months: 9
  Training Instances: 15980
  Test Instances: 3245
  Training Month Files:
    1. 2000-01_2000-01.csv
    2. 2000-02_2020-02.csv
    ... (all 51 files listed)
  Test Month Files:
    1. 2004-04_2004-04.csv
    ... (all 9 files listed)

Monthly Label Distribution

With --monthly-distribution flag, the script generates a detailed table showing:

Label distribution per month
Cumulative label counts up to each month
Total samples per month
Cumulative total samples up to each month

================================================================================
MONTHLY LABEL DISTRIBUTION - Apache
================================================================================
       Month |           L0 |        CumL0 |           L1 |        CumL1 |        Total |     CumTotal
------------------------------------------------------------------------------------------------------
2000-01_2000-01 |          150 |          150 |           80 |           80 |          230 |          230
2000-02_2000-02 |          120 |          270 |           95 |          175 |          215 |          445
2000-03_2000-03 |          135 |          405 |          110 |          285 |          245 |          690
... (continues for all months)
================================================================================
Legend: L0/L1 = Label counts, CumL = Cumulative label counts
Total = Total samples in month, CumTotal = Cumulative total samples

CSV output format for monthly distribution:

project,month_file,month_total,cumulative_total,label_0_count,label_0_cumulative,label_1_count,label_1_cumulative
Apache,2000-01_2000-01.csv,230,230,150,150,80,80
Apache,2000-02_2000-02.csv,215,445,120,270,95,175

Notes

Actual statistics require data/windows/month_range_1/{project}/ populated by create_time_windows.py
Test data uses _get_files_after_range() to load files chronologically after the training period
Script shows zeros with instructions when data unavailable
Month files are included in CSV export (semicolon-separated) for further analysis
Monthly distribution feature works with any number of labels (dynamic column generation)
Multiple configs supported with automatic filename generation for CSV output

Original prompt

This section details on the original issue you should resolve

<issue_title>Save dataset stats</issue_title>
<issue_description>I need to fill this table about the dataset.

\begin{table}[t]
\centering
\caption{Temporal segmentation summary for selected ecosystems.}
\label{tab:temporal-slices-summary}
\resizebox{\linewidth}{!}{%
\begin{tabular}{rrrr}
\toprule
\textbf{Ecosystem} & \textbf{Training Months} & \textbf{Test Months} & \textbf{Training Instances} \
\midrule
Apache & \texttt{51} & \texttt{XXX} & \texttt{XXX} \
Jira & \texttt{59} & \texttt{XXX} & \texttt{XXX} \
RedHat & \texttt{57} & \texttt{XXX} & \texttt{XXX} \
\bottomrule
\end{tabular}%
}
\end{table}
I need a script that calculates, based on the configs for months (m1 in the name e.g. config-apache-m1.yaml), how many train, test months are there and how many are the training and testing instances.</issue_description>

<agent_instructions>reuse the available code if possible</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Save dataset stats #1

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

peppocola · 2025-12-15T14:39:58Z

I need you also to list the months you included in training for debugging reasons @copilot

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot · 2025-12-15T14:46:59Z

I need you also to list the months you included in training for debugging reasons @copilot

Added training and test month file listings to the output in commit 244d5a8. The script now displays:

A numbered list of all training month files (e.g., 2000-01_2000-01.csv)
A numbered list of all test month files
These are logged during data loading and shown in the summary output

The files are sorted chronologically to make debugging easier.

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

peppocola · 2025-12-15T14:58:50Z

produce also a table with the distribution of labels for each month. I also need to know how many samples of each label i have till a given month. and i also need to know how many total samples i have if i consider each month till a given month. Can you add this as an output? @copilot

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot · 2025-12-15T15:07:23Z

produce also a table with the distribution of labels for each month. I also need to know how many samples of each label i have till a given month. and i also need to know how many total samples i have...

Added monthly label distribution feature in commit 70f6874. Use --monthly-distribution flag to get:

Table showing label distribution per month - displays count of each label in each month
Cumulative label counts - shows how many samples of each label exist up to each month
Cumulative total samples - shows total samples accumulated up to each month

Example output:

       Month |           L0 |        CumL0 |           L1 |        CumL1 |        Total |     CumTotal
2020-01_2020-01 |            2 |            2 |            1 |            1 |            3 |            3
2020-02_2020-02 |            0 |            2 |            2 |            3 |            2 |            5

Save to CSV with --monthly-output-csv filename.csv for further analysis.

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Initial plan

213a0d5

Copilot AI assigned Copilot and peppocola Dec 15, 2025

Copilot started work on behalf of peppocola December 15, 2025 14:20 View session

Copilot AI and others added 2 commits December 15, 2025 14:28

Add script to calculate dataset statistics from config files

fefdea3

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Address code review feedback - improve imports and logging

e4bee8d

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add script to save dataset statistics~~ Add script to calculate dataset statistics for temporal segmentation Dec 15, 2025

Copilot AI requested a review from peppocola December 15, 2025 14:37

Copilot finished work on behalf of peppocola December 15, 2025 14:37

Copilot started work on behalf of peppocola December 15, 2025 14:40 View session

Add listing of training and test month files for debugging

244d5a8

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot AI and others added 2 commits December 15, 2025 14:49

Fix code review issues: update docstring and CSV output

044ec92

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Use list() instead of tolist() for better compatibility

4e00d16

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot finished work on behalf of peppocola December 15, 2025 14:53

Copilot started work on behalf of peppocola December 15, 2025 14:59 View session

Add monthly label distribution with cumulative statistics

70f6874

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot AI and others added 3 commits December 15, 2025 15:09

Fix legend and separator alignment for variable label counts

e700715

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Remove test config file that was accidentally committed

cf7171b

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Improve filename handling with Path for robustness

ce8574c

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>

Copilot AI changed the title ~~Add script to calculate dataset statistics for temporal segmentation~~ Add script to calculate dataset statistics and monthly label distribution for temporal segmentation Dec 15, 2025

Copilot finished work on behalf of peppocola December 15, 2025 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to calculate dataset statistics and monthly label distribution for temporal segmentation#2

Add script to calculate dataset statistics and monthly label distribution for temporal segmentation#2
Copilot wants to merge 10 commits intomainfrom
copilot/save-dataset-stats

Copilot AI commented Dec 15, 2025 •

edited

Loading

Uh oh!

peppocola commented Dec 15, 2025

Uh oh!

Copilot AI commented Dec 15, 2025

Uh oh!

peppocola commented Dec 15, 2025

Uh oh!

Copilot AI commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Usage

Output Example

Dataset Statistics Summary

Monthly Label Distribution

Notes

Comments on the Issue (you are @copilot in this section)

Uh oh!

peppocola commented Dec 15, 2025

Uh oh!

Copilot AI commented Dec 15, 2025

Uh oh!

peppocola commented Dec 15, 2025

Uh oh!

Copilot AI commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 15, 2025 •

edited

Loading