Skip to content

Add script to calculate dataset statistics and monthly label distribution for temporal segmentation#2

Draft
Copilot wants to merge 10 commits intomainfrom
copilot/save-dataset-stats
Draft

Add script to calculate dataset statistics and monthly label distribution for temporal segmentation#2
Copilot wants to merge 10 commits intomainfrom
copilot/save-dataset-stats

Conversation

Copy link

Copilot AI commented Dec 15, 2025

Need to calculate training/test months and instances from configuration files to populate a LaTeX table for paper submission.

Implementation

  • calculate_dataset_stats.py: Script that reuses DataHandler, ConfigManager, and LabelMapper to calculate statistics using the same data loading logic as the classification pipeline
    • Counts unique time window files (months) for train/test splits
    • Counts total instances from loaded DataFrames
    • Lists all training and test month files for debugging purposes (chronologically sorted)
    • Calculates monthly label distribution with cumulative statistics
    • Outputs LaTeX table and optional CSV format

Usage

# Process default configs (apache-m1, jira-m1, redhat-m1)
python calculate_dataset_stats.py

# Generate table for paper
python calculate_dataset_stats.py --output dataset_stats.tex

# Save statistics with month file details to CSV
python calculate_dataset_stats.py --output-csv dataset_stats.csv

# Calculate and display monthly label distribution
python calculate_dataset_stats.py --monthly-distribution

# Save monthly distribution to CSV
python calculate_dataset_stats.py --monthly-distribution --monthly-output-csv monthly_stats.csv

Output Example

Dataset Statistics Summary

\begin{table}[t]
\centering
\caption{Temporal segmentation summary for selected ecosystems.}
\label{tab:temporal-slices-summary}
\resizebox{\linewidth}{!}{%
\begin{tabular}{rrrr}
\toprule
\textbf{Ecosystem} & \textbf{Training Months} & \textbf{Test Months} & \textbf{Training Instances} \\
\midrule
Apache & \texttt{51} & \texttt{9} & \texttt{15980} \\
...

The script also outputs detailed month file listings for debugging:

Apache:
  Training Months: 51
  Test Months: 9
  Training Instances: 15980
  Test Instances: 3245
  Training Month Files:
    1. 2000-01_2000-01.csv
    2. 2000-02_2020-02.csv
    ... (all 51 files listed)
  Test Month Files:
    1. 2004-04_2004-04.csv
    ... (all 9 files listed)

Monthly Label Distribution

With --monthly-distribution flag, the script generates a detailed table showing:

  • Label distribution per month
  • Cumulative label counts up to each month
  • Total samples per month
  • Cumulative total samples up to each month
================================================================================
MONTHLY LABEL DISTRIBUTION - Apache
================================================================================
       Month |           L0 |        CumL0 |           L1 |        CumL1 |        Total |     CumTotal
------------------------------------------------------------------------------------------------------
2000-01_2000-01 |          150 |          150 |           80 |           80 |          230 |          230
2000-02_2000-02 |          120 |          270 |           95 |          175 |          215 |          445
2000-03_2000-03 |          135 |          405 |          110 |          285 |          245 |          690
... (continues for all months)
================================================================================
Legend: L0/L1 = Label counts, CumL = Cumulative label counts
Total = Total samples in month, CumTotal = Cumulative total samples

CSV output format for monthly distribution:

project,month_file,month_total,cumulative_total,label_0_count,label_0_cumulative,label_1_count,label_1_cumulative
Apache,2000-01_2000-01.csv,230,230,150,150,80,80
Apache,2000-02_2000-02.csv,215,445,120,270,95,175

Notes

  • Actual statistics require data/windows/month_range_1/{project}/ populated by create_time_windows.py
  • Test data uses _get_files_after_range() to load files chronologically after the training period
  • Script shows zeros with instructions when data unavailable
  • Month files are included in CSV export (semicolon-separated) for further analysis
  • Monthly distribution feature works with any number of labels (dynamic column generation)
  • Multiple configs supported with automatic filename generation for CSV output
Original prompt

This section details on the original issue you should resolve

<issue_title>Save dataset stats</issue_title>
<issue_description>I need to fill this table about the dataset.

\begin{table}[t]
\centering
\caption{Temporal segmentation summary for selected ecosystems.}
\label{tab:temporal-slices-summary}
\resizebox{\linewidth}{!}{%
\begin{tabular}{rrrr}
\toprule
\textbf{Ecosystem} & \textbf{Training Months} & \textbf{Test Months} & \textbf{Training Instances} \
\midrule
Apache & \texttt{51} & \texttt{XXX} & \texttt{XXX} \
Jira & \texttt{59} & \texttt{XXX} & \texttt{XXX} \
RedHat & \texttt{57} & \texttt{XXX} & \texttt{XXX} \
\bottomrule
\end{tabular}%
}
\end{table}
I need a script that calculates, based on the configs for months (m1 in the name e.g. config-apache-m1.yaml), how many train, test months are there and how many are the training and testing instances.</issue_description>

<agent_instructions>reuse the available code if possible</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits December 15, 2025 14:28
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Copilot AI changed the title [WIP] Add script to save dataset statistics Add script to calculate dataset statistics for temporal segmentation Dec 15, 2025
Copilot AI requested a review from peppocola December 15, 2025 14:37
@peppocola
Copy link
Contributor

I need you also to list the months you included in training for debugging reasons @copilot

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 15, 2025

I need you also to list the months you included in training for debugging reasons @copilot

Added training and test month file listings to the output in commit 244d5a8. The script now displays:

  • A numbered list of all training month files (e.g., 2000-01_2000-01.csv)
  • A numbered list of all test month files
  • These are logged during data loading and shown in the summary output

The files are sorted chronologically to make debugging easier.

Copilot AI and others added 2 commits December 15, 2025 14:49
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
@peppocola
Copy link
Contributor

produce also a table with the distribution of labels for each month. I also need to know how many samples of each label i have till a given month. and i also need to know how many total samples i have if i consider each month till a given month. Can you add this as an output? @copilot

Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 15, 2025

produce also a table with the distribution of labels for each month. I also need to know how many samples of each label i have till a given month. and i also need to know how many total samples i have...

Added monthly label distribution feature in commit 70f6874. Use --monthly-distribution flag to get:

  1. Table showing label distribution per month - displays count of each label in each month
  2. Cumulative label counts - shows how many samples of each label exist up to each month
  3. Cumulative total samples - shows total samples accumulated up to each month

Example output:

       Month |           L0 |        CumL0 |           L1 |        CumL1 |        Total |     CumTotal
2020-01_2020-01 |            2 |            2 |            1 |            1 |            3 |            3
2020-02_2020-02 |            0 |            2 |            2 |            3 |            2 |            5

Save to CSV with --monthly-output-csv filename.csv for further analysis.

Copilot AI and others added 3 commits December 15, 2025 15:09
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Co-authored-by: peppocola <45401861+peppocola@users.noreply.github.com>
Copilot AI changed the title Add script to calculate dataset statistics for temporal segmentation Add script to calculate dataset statistics and monthly label distribution for temporal segmentation Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Save dataset stats

2 participants