Skip to content

Conversation

@marcellodebernardi
Copy link
Contributor

This PR is a depressingly confusing mish-mash of several bugfixes, improvements, and clean up. The main changes are:

  1. The DatasetGenerator has been simplified, cleaned up, made properly async, and can now handle column-wise data augmentation (earlier only row-wise).
  2. The traces of the smolagents agents are now logged to mlflow for easy tracking
  3. Improvements to mlflow tracking, chain of thought is more informative, etc
  4. Include the EDA report in the model bundle and mlflow for convenience
  5. Miscelanneous bugfixes

Yes, this is a terrible PR 🚀

@marcellodebernardi marcellodebernardi marked this pull request as ready for review May 19, 2025 08:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors dataset generation, enhances MLflow integration, and cleans up related tooling and examples.

  • Simplified DatasetGenerator to use SimpleLLMDataGenerator with proper async and column‐wise augmentation.
  • Overhauled MLFlowCallback to auto-log smolagents, include EDA markdown artifacts, and updated its unit tests.
  • Updated project config, prompts, and utilities for EDA report handling and cleaned up obsolete code paths.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/internal/models/callbacks/test_mlflow.py Adjusted MLflow tests to mock experiment creation and set_experiment
tests/unit/internal/datasets/core/generation/utils/test_oversampling.py Removed obsolete SMOTE oversampling tests
pyproject.toml Bumped package version; updated mlflow, torch, smolagents, transformers, tokenizers
plexe/templates/prompts/agent/schema_resolver_prompt.jinja Refined prompt language and schema instructions
plexe/templates/prompts/agent/agent_manager_prompt.jinja Removed personal name from manager prompt
plexe/models.py Added formatting and storage of EDA markdown reports
plexe/internal/models/tools/datasets.py Extended split_datasets to return dataset_sizes; strip suffixes in EDA lookup
plexe/internal/models/callbacks/mlflow.py Configured experiment creation, smolagents autolog, and EDA artifact logging
plexe/internal/datasets/generator.py Switched to SimpleLLMDataGenerator and cleaned imports
plexe/internal/datasets/core/generation/utils/oversampling.py Deleted obsolete oversampling utility
plexe/internal/datasets/core/generation/simple_llm.py Implemented async batch/column data generation
plexe/internal/datasets/core/generation/combined.py Removed CombinedDataGenerator
plexe/internal/datasets/core/generation/base.py Expanded interface to support column-only generation
plexe/internal/datasets/config.py Locked generator to "simple" and cleaned up instructions
plexe/internal/common/utils/response.py Added JSON array extraction and DataFrame conversion helpers
plexe/fileio.py Included EDA markdown in model bundle on save/load
plexe/datasets.py Enhanced DatasetGenerator for column augmentation and schema validation
plexe/agents/dataset_analyser.py Expanded authorized imports for dataset analyser agent
examples/dataset_generation.py New example for synthetic data generation
examples/dataset_augmentation.py New example for dataset augmentation
Comments suppressed due to low confidence (4)

plexe/datasets.py:96

  • Name 'DataGenerator' is not imported in this module. Add the correct import (e.g. from plexe.internal.datasets.generator import SimpleLLMDataGenerator or alias your generator) to avoid NameError.
self.data_generator = DataGenerator(self.provider, self.description, self.schema)

tests/unit/internal/models/callbacks/test_mlflow.py:183

  • The new EDA artifact logging logic in on_build_end isn't covered by this test. Consider adding a test where info.node.metadata['eda_markdown_reports'] is set and assert that mlflow.log_artifact is called.
# Call on_build_end

plexe/internal/models/callbacks/mlflow.py:104

  • Missing import for Path. Add 'from pathlib import Path' at the top of this module to avoid NameError at runtime.
report_path = Path(f"eda_report_{dataset_name}.md")

plexe/internal/common/utils/response.py:183

  • Logger is used but not defined. Add 'logger = logging.getLogger(name)' at the top of this file so the module-level logger is available.
logger.warning("JSON is a single object, converting to list")

Comment on lines +61 to +64
# Create experiment
self.experiment_id = mlflow.create_experiment(self.experiment_name)
mlflow.set_experiment(experiment_name=self.experiment_name)
logger.debug(f"✅ MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})")
Copy link

Copilot AI May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new experiment unconditionally on initialization can lead to duplicates or errors if the experiment already exists. Use get_experiment_by_name first and only call create_experiment when it returns None.

Suggested change
# Create experiment
self.experiment_id = mlflow.create_experiment(self.experiment_name)
mlflow.set_experiment(experiment_name=self.experiment_name)
logger.debug(f"✅ MLFlow configured with experiment '{self.experiment_name}' (ID: {self.experiment_id})")
# Create or retrieve experiment
experiment = mlflow.get_experiment_by_name(self.experiment_name)
if experiment is None:
self.experiment_id = mlflow.create_experiment(self.experiment_name)
logger.debug(f"✅ Created new MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})")
else:
self.experiment_id = experiment.experiment_id
logger.debug(f"✅ Retrieved existing MLFlow experiment '{self.experiment_name}' (ID: {self.experiment_id})")
mlflow.set_experiment(experiment_name=self.experiment_name)

Copilot uses AI. Check for mistakes.
@plexe-ai plexe-ai deleted a comment from jazzberry-ai bot May 19, 2025
@marcellodebernardi marcellodebernardi deleted the refactor/datasets-mlflow-and-more branch May 19, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants