-
Notifications
You must be signed in to change notification settings - Fork 221
refactor: datasets mlflow and more #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors dataset generation, enhances MLflow integration, and cleans up related tooling and examples.
- Simplified
DatasetGeneratorto useSimpleLLMDataGeneratorwith proper async and column‐wise augmentation. - Overhauled
MLFlowCallbackto auto-log smolagents, include EDA markdown artifacts, and updated its unit tests. - Updated project config, prompts, and utilities for EDA report handling and cleaned up obsolete code paths.
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/internal/models/callbacks/test_mlflow.py | Adjusted MLflow tests to mock experiment creation and set_experiment |
| tests/unit/internal/datasets/core/generation/utils/test_oversampling.py | Removed obsolete SMOTE oversampling tests |
| pyproject.toml | Bumped package version; updated mlflow, torch, smolagents, transformers, tokenizers |
| plexe/templates/prompts/agent/schema_resolver_prompt.jinja | Refined prompt language and schema instructions |
| plexe/templates/prompts/agent/agent_manager_prompt.jinja | Removed personal name from manager prompt |
| plexe/models.py | Added formatting and storage of EDA markdown reports |
| plexe/internal/models/tools/datasets.py | Extended split_datasets to return dataset_sizes; strip suffixes in EDA lookup |
| plexe/internal/models/callbacks/mlflow.py | Configured experiment creation, smolagents autolog, and EDA artifact logging |
| plexe/internal/datasets/generator.py | Switched to SimpleLLMDataGenerator and cleaned imports |
| plexe/internal/datasets/core/generation/utils/oversampling.py | Deleted obsolete oversampling utility |
| plexe/internal/datasets/core/generation/simple_llm.py | Implemented async batch/column data generation |
| plexe/internal/datasets/core/generation/combined.py | Removed CombinedDataGenerator |
| plexe/internal/datasets/core/generation/base.py | Expanded interface to support column-only generation |
| plexe/internal/datasets/config.py | Locked generator to "simple" and cleaned up instructions |
| plexe/internal/common/utils/response.py | Added JSON array extraction and DataFrame conversion helpers |
| plexe/fileio.py | Included EDA markdown in model bundle on save/load |
| plexe/datasets.py | Enhanced DatasetGenerator for column augmentation and schema validation |
| plexe/agents/dataset_analyser.py | Expanded authorized imports for dataset analyser agent |
| examples/dataset_generation.py | New example for synthetic data generation |
| examples/dataset_augmentation.py | New example for dataset augmentation |
Comments suppressed due to low confidence (4)
plexe/datasets.py:96
- Name 'DataGenerator' is not imported in this module. Add the correct import (e.g. from plexe.internal.datasets.generator import SimpleLLMDataGenerator or alias your generator) to avoid NameError.
self.data_generator = DataGenerator(self.provider, self.description, self.schema)
tests/unit/internal/models/callbacks/test_mlflow.py:183
- The new EDA artifact logging logic in
on_build_endisn't covered by this test. Consider adding a test whereinfo.node.metadata['eda_markdown_reports']is set and assert thatmlflow.log_artifactis called.
# Call on_build_end
plexe/internal/models/callbacks/mlflow.py:104
- Missing import for Path. Add 'from pathlib import Path' at the top of this module to avoid NameError at runtime.
report_path = Path(f"eda_report_{dataset_name}.md")
plexe/internal/common/utils/response.py:183
- Logger is used but not defined. Add 'logger = logging.getLogger(name)' at the top of this file so the module-level logger is available.
logger.warning("JSON is a single object, converting to list")
| # Create experiment | ||
| self.experiment_id = mlflow.create_experiment(self.experiment_name) | ||
| mlflow.set_experiment(experiment_name=self.experiment_name) | ||
| logger.debug(f"✅ MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})") |
Copilot
AI
May 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new experiment unconditionally on initialization can lead to duplicates or errors if the experiment already exists. Use get_experiment_by_name first and only call create_experiment when it returns None.
| # Create experiment | |
| self.experiment_id = mlflow.create_experiment(self.experiment_name) | |
| mlflow.set_experiment(experiment_name=self.experiment_name) | |
| logger.debug(f"✅ MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})") | |
| # Create or retrieve experiment | |
| experiment = mlflow.get_experiment_by_name(self.experiment_name) | |
| if experiment is None: | |
| self.experiment_id = mlflow.create_experiment(self.experiment_name) | |
| logger.debug(f"✅ Created new MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})") | |
| else: | |
| self.experiment_id = experiment.experiment_id | |
| logger.debug(f"✅ Retrieved existing MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})") | |
| mlflow.set_experiment(experiment_name=self.experiment_name) |
This PR is a depressingly confusing mish-mash of several bugfixes, improvements, and clean up. The main changes are:
DatasetGeneratorhas been simplified, cleaned up, made properlyasync, and can now handle column-wise data augmentation (earlier only row-wise).smolagentsagents are now logged tomlflowfor easy trackingmlflowtracking, chain of thought is more informative, etcmlflowfor convenienceYes, this is a terrible PR 🚀