fix: dataset handling improvements #117

marcellodebernardi · 2025-05-17T22:16:13Z

This PR makes two main changes to plexe:

It adds a "data analyser" agent that carries out statistical analysis on the input dataset and generates an "EDA report", which the other agents have access to.
It improves the dataset splitting tool to handle time series data correctly
It makes several miscellaneous improvements to small things like file clean-up after training, COT summaries, etc

Copilot

Pull Request Overview

This PR enhances dataset handling by introducing an EDA agent, improving time series splits, and applying miscellaneous clean-up and prompt updates.

Added a new EDA agent with register/get EDA report tools and integrated it into the build pipeline
Updated dataset splitting (both in TabularDataset and split_datasets) to support chronological time series
Performed minor clean-ups: bumped version, cleaned temp files in MLflow callback, refined COT summarization prompt

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pyproject.toml	Bump version to 0.20.0
plexe/templates/prompts/utils/cot_summarize.jinja	Refined COT summary instructions
plexe/templates/prompts/planning/generate.jinja	Removed obsolete planning generate template
plexe/templates/prompts/agent/schema_resolver_prompt.jinja	Updated to use EDA tools in schema resolution
plexe/templates/prompts/agent/mls_prompt_templates.yaml	Added guidance for using get_eda_report
plexe/templates/prompts/agent/eda_prompt_templates.yaml	New prompts for the EDA agent
plexe/agents/dataset_analyser.py	New EdaAgent implementation
plexe/models.py	Integrated EdaAgent run and stored EDA metadata
plexe/internal/models/tools/datasets.py	Added is_time_series split and EDA report tools
plexe/internal/common/datasets/tabular.py	Extended split() to handle time series splits
plexe/internal/models/callbacks/mlflow.py	Clean up temp files after logging artifacts
plexe/internal/common/utils/chain_of_thought/emitters.py	Updated color mapping for new agents
plexe/internal/agents.py	Registered get_eda_report for tool-calling agents
plexe/config.py	Added prompt rendering for the EDA agent
docs/architecture/multi-agent-system.md	Documented EDA Agent in architecture

Comments suppressed due to low confidence (7)

plexe/templates/prompts/utils/cot_summarize.jinja:1

[nitpick] The old '-1.' and '-2.' list items overlap with the newly added numbered instructions, which may confuse template users. Consider removing or renumbering the obsolete list to keep the instruction set clear.

  -1. A clear, professional title (3-8 words) that captures the essence of what happened

plexe/internal/models/tools/datasets.py:211

The '@tool' decorator is used but not imported in this file. Add the appropriate import (e.g., 'from smolagents import tool') at the top to avoid NameError.

@tool

plexe/internal/models/tools/datasets.py:239

The code references 'ObjectRegistry' but it is not imported here. Add 'from plexe.internal.common.registries.objects import ObjectRegistry' at the top of the file.

    object_registry = ObjectRegistry()

plexe/internal/models/tools/datasets.py:211

Logger methods are called (e.g., logger.debug) but 'logger' is not defined. Add 'logger = logging.getLogger(name)' after the imports.

@tool

plexe/internal/models/tools/datasets.py:264

The get_eda_report function also uses '@tool', ObjectRegistry, and logger without importing them. Ensure you import 'tool', 'ObjectRegistry', and define 'logger' at the top of this file.

@tool

plexe/models.py:387

Storing all EDA reports under the same 'eda_report' key will overwrite previous entries. Use a unique metadata key per dataset (e.g., f"eda_report_{name}") or store them in a dict/list.

                self.metadata["eda_report"] = self.object_registry.get(dict, f"eda_report_{name}")

plexe/internal/models/callbacks/mlflow.py:152

The 'Path' class is used but not imported. Add 'from pathlib import Path' to the imports to avoid NameError during cleanup.

                    Path("trainer_source.py").unlink(missing_ok=True)

marcellodebernardi added 14 commits May 13, 2025 22:08

fix: trainer_source.py not cleaned up

5ea3b73

fix: handle dataset splitting for chronological data

4b2a668

fix: switch to codeagent for schema resolver

819aa39

feat: add data analyser agent

0e8979a

feat: add data analyser agent

1113e69

fix: put eda report as dict in metadata

0d8df15

feat: update multi-agent-system.md

df517a3

chore: bump to 0.20.0

05b8684

fix: misc improvements to dataset analyser

684cda3

fix: eda agent using wrong prompt template

f2023ba

chore: remove unused prompt template

b20dbbe

chore: remove unused plan generation template

cc22569

fix: emitter agent colors defined incorrectly

b6eef55

feat: make chain of thought summaries follow t/a/o structure

b94c721

marcellodebernardi requested a review from vaibs-d May 17, 2025 22:16

plexe-ai deleted a comment from jazzberry-ai bot May 17, 2025

marcellodebernardi requested a review from Copilot May 17, 2025 22:16

marcellodebernardi marked this pull request as ready for review May 17, 2025 22:16

Copilot AI reviewed May 17, 2025

View reviewed changes

vaibs-d approved these changes May 17, 2025

View reviewed changes

vaibs-d merged commit de96588 into main May 17, 2025
5 checks passed

vaibs-d deleted the fix/data-handling-improvements branch May 17, 2025 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: dataset handling improvements #117

fix: dataset handling improvements #117

Uh oh!

marcellodebernardi commented May 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: dataset handling improvements #117

fix: dataset handling improvements #117

Uh oh!

Conversation

marcellodebernardi commented May 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants