-
Notifications
You must be signed in to change notification settings - Fork 221
fix: dataset handling improvements #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances dataset handling by introducing an EDA agent, improving time series splits, and applying miscellaneous clean-up and prompt updates.
- Added a new EDA agent with register/get EDA report tools and integrated it into the build pipeline
- Updated dataset splitting (both in TabularDataset and split_datasets) to support chronological time series
- Performed minor clean-ups: bumped version, cleaned temp files in MLflow callback, refined COT summarization prompt
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Bump version to 0.20.0 |
| plexe/templates/prompts/utils/cot_summarize.jinja | Refined COT summary instructions |
| plexe/templates/prompts/planning/generate.jinja | Removed obsolete planning generate template |
| plexe/templates/prompts/agent/schema_resolver_prompt.jinja | Updated to use EDA tools in schema resolution |
| plexe/templates/prompts/agent/mls_prompt_templates.yaml | Added guidance for using get_eda_report |
| plexe/templates/prompts/agent/eda_prompt_templates.yaml | New prompts for the EDA agent |
| plexe/agents/dataset_analyser.py | New EdaAgent implementation |
| plexe/models.py | Integrated EdaAgent run and stored EDA metadata |
| plexe/internal/models/tools/datasets.py | Added is_time_series split and EDA report tools |
| plexe/internal/common/datasets/tabular.py | Extended split() to handle time series splits |
| plexe/internal/models/callbacks/mlflow.py | Clean up temp files after logging artifacts |
| plexe/internal/common/utils/chain_of_thought/emitters.py | Updated color mapping for new agents |
| plexe/internal/agents.py | Registered get_eda_report for tool-calling agents |
| plexe/config.py | Added prompt rendering for the EDA agent |
| docs/architecture/multi-agent-system.md | Documented EDA Agent in architecture |
Comments suppressed due to low confidence (7)
plexe/templates/prompts/utils/cot_summarize.jinja:1
- [nitpick] The old '-1.' and '-2.' list items overlap with the newly added numbered instructions, which may confuse template users. Consider removing or renumbering the obsolete list to keep the instruction set clear.
-1. A clear, professional title (3-8 words) that captures the essence of what happened
plexe/internal/models/tools/datasets.py:211
- The '@tool' decorator is used but not imported in this file. Add the appropriate import (e.g., 'from smolagents import tool') at the top to avoid NameError.
@tool
plexe/internal/models/tools/datasets.py:239
- The code references 'ObjectRegistry' but it is not imported here. Add 'from plexe.internal.common.registries.objects import ObjectRegistry' at the top of the file.
object_registry = ObjectRegistry()
plexe/internal/models/tools/datasets.py:211
- Logger methods are called (e.g., logger.debug) but 'logger' is not defined. Add 'logger = logging.getLogger(name)' after the imports.
@tool
plexe/internal/models/tools/datasets.py:264
- The get_eda_report function also uses '@tool', ObjectRegistry, and logger without importing them. Ensure you import 'tool', 'ObjectRegistry', and define 'logger' at the top of this file.
@tool
plexe/models.py:387
- Storing all EDA reports under the same 'eda_report' key will overwrite previous entries. Use a unique metadata key per dataset (e.g., f"eda_report_{name}") or store them in a dict/list.
self.metadata["eda_report"] = self.object_registry.get(dict, f"eda_report_{name}")
plexe/internal/models/callbacks/mlflow.py:152
- The 'Path' class is used but not imported. Add 'from pathlib import Path' to the imports to avoid NameError during cleanup.
Path("trainer_source.py").unlink(missing_ok=True)
This PR makes two main changes to
plexe: