Skip to content

Conversation

@marcellodebernardi
Copy link
Contributor

This PR makes two main changes to plexe:

  1. It adds a "data analyser" agent that carries out statistical analysis on the input dataset and generates an "EDA report", which the other agents have access to.
  2. It improves the dataset splitting tool to handle time series data correctly
  3. It makes several miscellaneous improvements to small things like file clean-up after training, COT summaries, etc

@plexe-ai plexe-ai deleted a comment from jazzberry-ai bot May 17, 2025
@marcellodebernardi marcellodebernardi marked this pull request as ready for review May 17, 2025 22:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances dataset handling by introducing an EDA agent, improving time series splits, and applying miscellaneous clean-up and prompt updates.

  • Added a new EDA agent with register/get EDA report tools and integrated it into the build pipeline
  • Updated dataset splitting (both in TabularDataset and split_datasets) to support chronological time series
  • Performed minor clean-ups: bumped version, cleaned temp files in MLflow callback, refined COT summarization prompt

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Bump version to 0.20.0
plexe/templates/prompts/utils/cot_summarize.jinja Refined COT summary instructions
plexe/templates/prompts/planning/generate.jinja Removed obsolete planning generate template
plexe/templates/prompts/agent/schema_resolver_prompt.jinja Updated to use EDA tools in schema resolution
plexe/templates/prompts/agent/mls_prompt_templates.yaml Added guidance for using get_eda_report
plexe/templates/prompts/agent/eda_prompt_templates.yaml New prompts for the EDA agent
plexe/agents/dataset_analyser.py New EdaAgent implementation
plexe/models.py Integrated EdaAgent run and stored EDA metadata
plexe/internal/models/tools/datasets.py Added is_time_series split and EDA report tools
plexe/internal/common/datasets/tabular.py Extended split() to handle time series splits
plexe/internal/models/callbacks/mlflow.py Clean up temp files after logging artifacts
plexe/internal/common/utils/chain_of_thought/emitters.py Updated color mapping for new agents
plexe/internal/agents.py Registered get_eda_report for tool-calling agents
plexe/config.py Added prompt rendering for the EDA agent
docs/architecture/multi-agent-system.md Documented EDA Agent in architecture
Comments suppressed due to low confidence (7)

plexe/templates/prompts/utils/cot_summarize.jinja:1

  • [nitpick] The old '-1.' and '-2.' list items overlap with the newly added numbered instructions, which may confuse template users. Consider removing or renumbering the obsolete list to keep the instruction set clear.
  -1. A clear, professional title (3-8 words) that captures the essence of what happened

plexe/internal/models/tools/datasets.py:211

  • The '@tool' decorator is used but not imported in this file. Add the appropriate import (e.g., 'from smolagents import tool') at the top to avoid NameError.
@tool

plexe/internal/models/tools/datasets.py:239

  • The code references 'ObjectRegistry' but it is not imported here. Add 'from plexe.internal.common.registries.objects import ObjectRegistry' at the top of the file.
    object_registry = ObjectRegistry()

plexe/internal/models/tools/datasets.py:211

  • Logger methods are called (e.g., logger.debug) but 'logger' is not defined. Add 'logger = logging.getLogger(name)' after the imports.
@tool

plexe/internal/models/tools/datasets.py:264

  • The get_eda_report function also uses '@tool', ObjectRegistry, and logger without importing them. Ensure you import 'tool', 'ObjectRegistry', and define 'logger' at the top of this file.
@tool

plexe/models.py:387

  • Storing all EDA reports under the same 'eda_report' key will overwrite previous entries. Use a unique metadata key per dataset (e.g., f"eda_report_{name}") or store them in a dict/list.
                self.metadata["eda_report"] = self.object_registry.get(dict, f"eda_report_{name}")

plexe/internal/models/callbacks/mlflow.py:152

  • The 'Path' class is used but not imported. Add 'from pathlib import Path' to the imports to avoid NameError during cleanup.
                    Path("trainer_source.py").unlink(missing_ok=True)

@vaibs-d vaibs-d merged commit de96588 into main May 17, 2025
5 checks passed
@vaibs-d vaibs-d deleted the fix/data-handling-improvements branch May 17, 2025 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants